What's The Big Deal About GPT-3 (And What Does It Mean For Startups)?

Around a year ago, Twitter was aflame about the second “Generative Pretrained Transformer” (GPT-2) model from OpenAI.  It was a masterpiece of Natural Language processing, a broad field that deals with the rules and structure of language in the attempt to understand, parse and now wholly generate it. GPT-2 was trained on a large chunk of text scraped from the web. With a whopping 1.5 billion tunable parameters, it was one of the biggest models ever constructed. Some dubbed it “too dangerous to release” as a trained model. 

Now roughly a year later, concerns about dangers have been allayed (or set to the side), and GPT-2 has been superseded by GPT-3, with 175 billion (!) parameters.  It was trained on a dataset called the Common Crawl, around a trillion words scraped from all over the web (including Wikipedia), and it is about 100 times larger than the one used to train GPT-2.  It is actually a perceptible fraction of all the text ever written! So what’s the big deal? Do we now have a true “thinking machine,” or a much more expensive reason to say “damn you autocorrect”?

Part I: A short history of natural language processing. 

Tl;dr : Researchers have been trying to get models to create language for a long time. Early models could offer the next best word in a sequence, but they fell short of producing true meaning.

With natural language processing, we are trying to coax a model to do what humans do, well, naturally. Say you’d like to look at part of a sentence (maybe the last few words typed) and decide what the next word is most likely to be.  To do that, a model must encode the part of language we’d like to use as an input, make a numerical transformation from input to output, then finally decode the output back into language again.  This problem in itself is incredibly complex, and researchers have spent  lot of time on it. The key thing to know is that in the earlier NLP work, limitations in the encoding, processing, and decoding process resulted in predictions that lacked contextual depth. For example, imagine you woke up one morning to your partner watching a movie you had never seen, and they asked you to guess the plot based on the first 5 seconds you saw.

The most ubiquitous example of this is probably the autocomplete function on a smartphone. In the text autocomplete example, we might train a model on a data set of text conversations, and arrange them in a predictable format (e.g. first word, second word, third word, fourth word). then we can train a neural net until it does the best job at taking any 3 words and guessing what the next one should be:

This allows me to type “the” into my phone and continually hit the “default” first button to this sentence:

“The first time you have to go back and I think you are a lot more interested to see if you have the same time and it would have to go through the same….”

Hmm.  Hmmmmmm.   Here we see the most immediate (and some would say biggest) problem with a lot of autocomplete approaches: a  lack of long-distance contextual coherence. Every sequence of 3-4 words above could have come from naturally-written language.  In that myopic context, it’s a success. But the whole thing doesn’t add up to anything of substance. The human mind easily picks this out as nonsense. In a sense, this model is doomed to have this limitation - if the inputs are only a few words arranged in a strict sequence, then the model also is limited in “memory.”  

Part II: Attention is All You Need

Tl;dr: The critical breakthrough in NLP was a model structure that allowed for much more flexible consideration of word inputs, which edged us closer to what we might call “context”.

In 2017, researchers introduced the Transformer model structure in a paper cheekily titled Attention Is All You Need:

The transformer was revolutionary and quickly several big and highly successful models were trained with it.  History was made; records smashed.  There was really only one hitch - the “big” part.  Since NLP models and deep neural nets are designed to encode and process almost anything, they are extremely computationally intensive. Attention matrices (which are a component of transformer models) are even bigger. They have an intrinsic “number of words squared”-size scale, which means that deep attentional neural nets got bigger, and bigger, and BIGGER over the last few years.

Part III: GPT-3 is better, but bigger

Tl;dr The combination of the transformer model structure and mind boggling size of the training dataset is what makes GPT notable.

This brings us to the present. GPT-3’s parameter size is 10 times larger than the next-largest thing on this plot. It’s ginormous; running it uses large amounts of memory and compute resources. It cost $12 million dollars just to train it. Again, it comprises 175 billion tunable parameters and a training dataset that includes a non-negligible fraction of the entire content of the internet. Also, note that the “text” that GPT is trained on is not just English, or even just written language. It is encoded with an incredibly broad notion of text, spanning music, images and unfortunately, The Canterbury Tales. That means the model needn’t be retrained for new areas, as many previous models required for good performance. This complexity adds up to some spookily-good results (and some hilarious misses) as discussed below.

GPT-3’s parameter size is 10 times larger than the next-largest thing on this plot. Source: SearchEngine Journal.

So what do we truly have on our hands? GPT is a model that:

  • Allows next-phrase generation across almost any context, and can be “primed” with anything from a few words to a long passage
  • Guesses the next most likely phrases, but can “hold context” over a long distance and continue talking about the same topics as it started with.
  • Is unwieldy in size and cost to use
  • Can be tried over and over, generating different results each time, but as such is a “black box” inside, with abstract and difficult to query internal reasoning.
  • Can be prompted in very creative ways (see below) 

Part IV:  NLP model creates content. Hilarity ensues.

Tl;dr: GPT-3 has delivered incredible hits and yes, some big misses

Here are some of the highpoints of GPT-3 in just a few weeks since its release.

  • LearnFromAnyone claims it can enable a chatbot between you and historical figures.  How do they do it?  Just enter the user’s questions followed by “<Desired teacher name>:” and let GPT-3 fill in the rest.  The results are hit-or miss - the important thing is that this is very easy to do with nothing but a GPT-3 interface.  Our team, for example, could probably have produced this in a day:
  • The model can sometimes answer medical questions, even complex ones, by a simple formatting of the question. In the image below, the model returns some very confidently wrong answers. 
  • Code generation:  We have had some luck with this at PSL, and others have generated very interesting early results, including buttons, widgets and web objects of various sorts.  But, as you may imagine given the rest of the things in this list, not every result is a hit.
  • We can feed the algorithm a few lines of guitar tabs (as an example) and get a new song.
  • GPT is writing stories.  While a lot has been made of the great and readable results, it’s also worth noting that not all results are so great, especially with longer-form factual prompts.
  • Images.  No, seriously.  Technically you can just as easily train a GPT-2/3 structure on pixels, and the results are fascinating.

Part V:  GPT-3 for startups.

Tl;dr: GPT-3 has numerous business applications.

By now, a pattern has emerged - a few (sometimes uncanny) gems, alongside sometimes hilarious misses.  My perspective on these things is that there’s still a lot of value in one "gem" of a response buried in a bunch of results that aren't quite good enough to use, so long as we know how to leverage that one result.  Here are some “seed ideas” that could get us closer to the best uses of GPT-3:

  • Text/Feed Summarization:  Clever summarization techniques have been tried so far, such as writing an extended passage and then prompting the generator with “TL;DR:”.  However, we also have access to algorithms used to vectorize content and apply simple similarity scores.  It is possible that by combining these technologies, we might produce a better tool for rapidly summarizing feed information.
  • Code Generation: As mentioned above, it may be possible to use GPT-3 as-is with creative prompts to generate usable code, but as with text writing, the results tend to be hit-or-miss.  In addition to doing research to find the best prompt styles, we could also build the code to test the automatically-generated code, and create an ensemble of outputs until those tests are passed.
  • Accelerated writing: A lot of good text generation is possible, but still requires significant human interaction to be usable. I have been tinkering with a method I called “Reductive Iterative Human-In-the-Loop” (RIHIL), which is really just a fancy way of saying “let a human add lots of text using a generator but then only keep the stuff that seems relevant in context and rewrite/override continuously.”  When combined with the above techniques (context and grammar checking, secondary relevance filters), we may achieve a creative writing tool that dramatically accelerates the process.B y the same token, we could also produce summary paragraphs from shorter statistics (e.g. sports reporting) given the breadth of informational structure GPT-3 has accumulated.

We at PSL are excited to work on GPT-3 concepts, and explore possibilities for what some are calling “the next internet”. Are you thinking about a startup idea leveraging GPT-3? Get in touch at hello@psl.com.