Around a year ago, Twitter was aflame about the second “Generative Pretrained Transformer” (GPT-2) model from OpenAI. It was a masterpiece of Natural Language processing, a broad field that deals with the rules and structure of language in the attempt to understand, parse and now wholly generate it. GPT-2 was trained on a large chunk of text scraped from the web. With a whopping 1.5 billion tunable parameters, it was one of the biggest models ever constructed. Some dubbed it “too dangerous to release” as a trained model.
Now roughly a year later, concerns about dangers have been allayed (or set to the side), and GPT-2 has been superseded by GPT-3, with 175 billion (!) parameters. It was trained on a dataset called the Common Crawl, around a trillion words scraped from all over the web (including Wikipedia), and it is about 100 times larger than the one used to train GPT-2. It is actually a perceptible fraction of all the text ever written! So what’s the big deal? Do we now have a true “thinking machine,” or a much more expensive reason to say “damn you autocorrect”?
Part I: A short history of natural language processing.
Tl;dr : Researchers have been trying to get models to create language for a long time. Early models could offer the next best word in a sequence, but they fell short of producing true meaning.
With natural language processing, we are trying to coax a model to do what humans do, well, naturally. Say you’d like to look at part of a sentence (maybe the last few words typed) and decide what the next word is most likely to be. To do that, a model must encode the part of language we’d like to use as an input, make a numerical transformation from input to output, then finally decode the output back into language again. This problem in itself is incredibly complex, and researchers have spent lot of time on it. The key thing to know is that in the earlier NLP work, limitations in the encoding, processing, and decoding process resulted in predictions that lacked contextual depth. For example, imagine you woke up one morning to your partner watching a movie you had never seen, and they asked you to guess the plot based on the first 5 seconds you saw.
The most ubiquitous example of this is probably the autocomplete function on a smartphone. In the text autocomplete example, we might train a model on a data set of text conversations, and arrange them in a predictable format (e.g. first word, second word, third word, fourth word). then we can train a neural net until it does the best job at taking any 3 words and guessing what the next one should be:
This allows me to type “the” into my phone and continually hit the “default” first button to this sentence:
“The first time you have to go back and I think you are a lot more interested to see if you have the same time and it would have to go through the same….”
Hmm. Hmmmmmm. Here we see the most immediate (and some would say biggest) problem with a lot of autocomplete approaches: a lack of long-distance contextual coherence. Every sequence of 3-4 words above could have come from naturally-written language. In that myopic context, it’s a success. But the whole thing doesn’t add up to anything of substance. The human mind easily picks this out as nonsense. In a sense, this model is doomed to have this limitation - if the inputs are only a few words arranged in a strict sequence, then the model also is limited in “memory.”
Part II: Attention is All You Need
Tl;dr: The critical breakthrough in NLP was a model structure that allowed for much more flexible consideration of word inputs, which edged us closer to what we might call “context”.
In 2017, researchers introduced the Transformer model structure in a paper cheekily titled Attention Is All You Need:
The transformer was revolutionary and quickly several big and highly successful models were trained with it. History was made; records smashed. There was really only one hitch - the “big” part. Since NLP models and deep neural nets are designed to encode and process almost anything, they are extremely computationally intensive. Attention matrices (which are a component of transformer models) are even bigger. They have an intrinsic “number of words squared”-size scale, which means that deep attentional neural nets got bigger, and bigger, and BIGGER over the last few years.
Part III: GPT-3 is better, but bigger
Tl;dr The combination of the transformer model structure and mind boggling size of the training dataset is what makes GPT notable.
This brings us to the present. GPT-3’s parameter size is 10 times larger than the next-largest thing on this plot. It’s ginormous; running it uses large amounts of memory and compute resources. It cost $12 million dollars just to train it. Again, it comprises 175 billion tunable parameters and a training dataset that includes a non-negligible fraction of the entire content of the internet. Also, note that the “text” that GPT is trained on is not just English, or even just written language. It is encoded with an incredibly broad notion of text, spanning music, images and unfortunately, The Canterbury Tales. That means the model needn’t be retrained for new areas, as many previous models required for good performance. This complexity adds up to some spookily-good results (and some hilarious misses) as discussed below.
So what do we truly have on our hands? GPT is a model that:
Part IV: NLP model creates content. Hilarity ensues.
Tl;dr: GPT-3 has delivered incredible hits and yes, some big misses
Here are some of the highpoints of GPT-3 in just a few weeks since its release.
Part V: GPT-3 for startups.
Tl;dr: GPT-3 has numerous business applications.
By now, a pattern has emerged - a few (sometimes uncanny) gems, alongside sometimes hilarious misses. My perspective on these things is that there’s still a lot of value in one "gem" of a response buried in a bunch of results that aren't quite good enough to use, so long as we know how to leverage that one result. Here are some “seed ideas” that could get us closer to the best uses of GPT-3:
We at PSL are excited to work on GPT-3 concepts, and explore possibilities for what some are calling “the next internet”. Are you thinking about a startup idea leveraging GPT-3? Get in touch at hello@psl.com.