Experimenting with GPT-3

I recently got access to GPT-3 and had been playing around with it. My initial use case was to help summarize the video lectures in the GANs specialization for my blog, but the results weren’t satisfactory. Instead, I turned my attention to SQL generation.

Part of my work at Adobe involves querying for information from our databases to service business and product stakeholders, so I was curious about its sql generation possibilities. Given a natural language instruction, was it possible to generate relatively sophisticated SQL? Most examples and blogs deal with very simple SQL with a few fields and one table. For multiple tables and joins, you’ll have to describe the tables in the prompt and GPT-3 will figure out a join phrase to get the data you need. I was more interested in how GPT-3 behaved given very few or a single example.

Our queries are large as they span lots of tables, so the first roadblock to using GPT-3 for SQL generation was the token size it could generate. GPT-3 ( as materialized in their playground app ) only takes a total of 2048 tokens, which includes the prompt and the generated text. We have queries that are larger than that alone.

That being said, there is a beta feature called fine-tuning which allows you to pass in larger training sets to train the model. I’ve yet to try it, so stay tuned.

To accommodate the limited token length, I eliminated the number of fields and table joins to come up with a simplified representation of our queries. The queries were also anonymized for security reasons.

Off the bat, I wanted to see if GPT-3 recognized fields that were parameterized. In this case we’re varying the account ids and the date range.

Following was done with the Davinci ( temp=0.2, topP=0.2 , freqP=0, prescenceP=0, bestOf=1) with parameters similar to OpenAI’s sql generation example https://beta.openai.com/playground/p/default-sql-request

Most of the prompt is taken up by examples, delineated by the ### marks. The last prompt/ask is what we’re interested in.

Get the number of impressions, total ad clicks, total net spend,and completions for the period between May 1 2021 and July 19 2021 for campaigns 716449, 716323, 716445, 716321, 716448, and 716322

select account_advertiser_name as advertiser_name , campaign_name , package_name, campaign_placement_name as placement_name, ad_template as ad_type , sum(coalesce(promo_impressions,0)) as impressions, sum(coalesce(clicks_banner,0)) as total_ad_clicks , sum(coalesce(tubemogul_media_fees,0) + coalesce(tubemogul_data_fees,0) + coalesce(tubemogul_other_fees,0) + coalesce(account_media_fees,0) + coalesce(account_data_fees,0) + coalesce(account_other_fees,0)) / cast(1000000 as double) as total_net_spend , sum(coalesce(complete100,0)) as complete100 , logdate from log_info where logdate >= '2021-05-01' and logdate < '2021-07-19' and campaign_id in (716449, 716323, 716445, 716321, 716448, 716322) 
###
Get the number of impressions, for the period between Jan 1 2021 and Feb 19 2021 for campaigns 6124, 5555, 0987, and 1234:

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(promo_impressions,0)) as impressions, logdate from log_info where logdate >= '2021-01-01' and logdate < '2021-02-19' and campaign_id in (6124, 5555, 0987, 1234)

Pretty good. It recognized impressions, didn’t give me the other aggregate fields and substituted the date and campaign ids with new values.

Second, how about the fields queried? Here I abbreviated the fields in the prompt so they were different from what was in the query ( spend <-> total_net_spend )

Generate sql to get the number of ad_clks, spend, cmpl , imps for the period between May 1 2021 and July 19 2021 for campaigns 716449, 716323, 716445, 716321, 716448, and 716322

select account_advertiser_name as advertiser_name , campaign_name , package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(clicks_banner,0)) as total_ad_clicks , sum(coalesce(tubemogul_media_fees,0) + coalesce(tubemogul_data_fees,0) + coalesce(tubemogul_other_fees,0) + coalesce(account_media_fees,0) + coalesce(account_data_fees,0) + coalesce(account_other_fees,0)) / cast(1000000 as double) as total_net_spend , sum(coalesce(complete100,0)) as complete100 , sum(coalesce(promo_impressions,0)) as impressions , logdate from log_info where logdate >= '2021-05-01' and logdate < '2021-07-19' and campaign_id in (716449, 716323, 716445, 716321, 716448, 716322) 
###
Get the number of spend for the period between Jan 1 2021 and Feb 19 2021 for campaigns 6124, 5555, 0987, and 1234:

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(spend,0)) as total_spend from log_info where logdate >= '2021-01-01' and logdate < '2021-02-19' and campaign_id in (6124, 5555, 0987, 1234)

Not bad, considering completions wasn’t a perfect match. Did it recognize the fields by the text or the order? How about if I scrambled the order in the query from the ask?

Generate sql to get the number of ad_clks, spend, cmpl , imps, logdate for the period between May 1 2021 and July 19 2021 for campaigns 716449, 716323, 716445, 716321, 716448, and 716322

select account_advertiser_name as advertiser_name , campaign_name , package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(clicks_banner,0)) as total_ad_clicks , sum(coalesce(tubemogul_media_fees,0) + coalesce(tubemogul_data_fees,0) + coalesce(tubemogul_other_fees,0) + coalesce(account_media_fees,0) + coalesce(account_data_fees,0) + coalesce(account_other_fees,0)) / cast(1000000 as double) as total_net_spend , sum(coalesce(complete100,0)) as complete100 , sum(coalesce(promo_impressions,0)) as impressions , logdate from log_info where logdate >= '2021-05-01' and logdate < '2021-07-19' and campaign_id in (716449, 716323, 716445, 716321, 716448, 716322) 
### 
Get only the cmpl for the period between Jan 1 2021 and Feb 19 2021 for campaigns 6124, 5555, 0987, and 1234: select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(complete100,0)) as complete100 from log_info where logdate >= '2021-01-01' and logdate < '2021-02-19' and campaign_id in (6124, 5555, 0987, 1234) 
### 
Get imps, cmpl, spend for the period between Jan 1 2020 and March 23 2021 for campaigns 6124, 5555, 0987, and 1234: 

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(impressions,0)) as impressions, sum(coalesce(complete100,0)) as complete100, sum(coalesce(promo_impressions,0)) as promo_impressions, sum(coalesce(clicks_banner,0)) as clicks_banner, sum(coalesce(tubemogul_media_fees,0) + coalesce(tubemogul_data_fees,0) + coalesce(tubemogul_other_fees,0) + coalesce(account_media_fees,0) + coalesce(account_data_fees,0) + coalesce(account_other_fees,0)) / cast(1000000 as double) as spend from log_info where logdate >= '2020-01-01' and logdate < '2021-03-23' and campaign_id in (6124, 5555, 0987, 1234)

Looks like it recognized the text. Not bad at all. Notice in the prompt how I gave a second example for it to understand when I wanted a single field. It was around this time that GPT3 started giving back all the fields instead of just a single field ( not sure why ), so this second example made the results more predictable.

How about if I totally scrambled the field names in the ask. I substituted ad clicks, spend, completions, and impressions with bluegrass, nimrod, beeblebox, and groot

Generate sql to get the number of bluegrass, nimrod, beeblebox, groot, and logdate for the period between May 1 2021 and July 19 2021 for campaigns 716449, 716323, 716445, 716321, 716448, and 716322 

select account_advertiser_name as advertiser_name , campaign_name , package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(clicks_banner,0)) as total_ad_clicks , sum(coalesce(tubemogul_media_fees,0) + coalesce(tubemogul_data_fees,0) + coalesce(tubemogul_other_fees,0) + coalesce(account_media_fees,0) + coalesce(account_data_fees,0) + coalesce(account_other_fees,0)) / cast(1000000 as double) as total_net_spend , sum(coalesce(complete100,0)) as complete100 , sum(coalesce(promo_impressions,0)) as impressions , logdate from log_info where logdate >= '2021-05-01' and logdate < '2021-07-19' and campaign_id in (716449, 716323, 716445, 716321, 716448, 716322) 
### 
Get only the beeblebox for the period between Jan 1 2021 and Feb 19 2021 for campaigns 6124, 5555, 0987, and 1234: 

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(complete100,0)) as complete100 from log_info where logdate >= '2021-01-01' and logdate < '2021-02-19' and campaign_id in (6124, 5555, 0987, 1234) 
### 
Get only the groot for the period between Jan 1 2020 and March 23 2021 for campaigns 6124, 5555, 0987, and 1234: 

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(complete100,0)) as complete100 from log_info where logdate >= '2020-01-01' and logdate < '2021-03-23' and campaign_id in (6124, 5555, 0987, 1234)

OK, it knew to give me only one field, probably because it saw the word ‘only’. However, it gave me completions instead of impressions.

What if I messed up the dates – giving it a time range of Jan 615 20212 and March 231 20212. This was interesting.

Generate sql to get the number of bluegrass, nimrod, beeblebox, groot, and logdate for the period between May 1 2021 and July 19 2021 for campaigns 716449, 716323, 716445, 716321, 716448, and 716322 

select account_advertiser_name as advertiser_name , campaign_name , package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(clicks_banner,0)) as total_ad_clicks , sum(coalesce(tubemogul_media_fees,0) + coalesce(tubemogul_data_fees,0) + coalesce(tubemogul_other_fees,0) + coalesce(account_media_fees,0) + coalesce(account_data_fees,0) + coalesce(account_other_fees,0)) / cast(1000000 as double) as total_net_spend , sum(coalesce(complete100,0)) as complete100 , sum(coalesce(promo_impressions,0)) as impressions , logdate from log_info where logdate >= '2021-05-01' and logdate < '2021-07-19' and campaign_id in (716449, 716323, 716445, 716321, 716448, 716322) 
### 
Get only the beeblebox for the period between Jan 1 2021 and Feb 19 2021 for campaigns 6124, 5555, 0987, and 1234: 

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(complete100,0)) as complete100 from log_info where logdate >= '2021-01-01' and logdate < '2021-02-19' and campaign_id in (6124, 5555, 0987, 1234) 
### 

Get only the groot for the period between Jan 615 20212 and March 231 20212 for campaigns 6124, 5555, 0987, and 1234: 

select account_advertiser_name as advertiser_name, campaign_name, package_name, campaign_placement_name as placement_name, ad_template as ad_type, sum(coalesce(complete100,0)) as complete100 from log_info where logdate >= '2021-01-06' and logdate < '2021-03-23' and campaign_id in (6124, 5555, 0987, 1234)

It still didn’t give me impressions, but it decided to provide its own interpretation of what the date was supposed to be. I have many other examples to show, but suffice it to say that as the examples became more complicated and verbose, the more confused it became, which makes sense given its limited training examples.

I’m sure if the prompt was expanded to accomodate more training examples, GPT-3 will be more intelligent about SQL generation, but suffice it to say it for now with the range of fields spanning multiple tables and keys typical in our data store, it won’t be able to take a user’s natural language input and generate the SQL to pull the data needed.

I would like to say I’ll publish more about this, but I just got access to OpenAI’s codex, so I’ll probably be dark for a while.