OpenAI GPT-2: Language Models are Multitask Learners

mediumThis post was originally published by Rohan Jagtap at Medium [AI]

Understanding Transformer-Based Self-Supervised Architectures

Natural Language Processing tasks initially included purely supervised training approaches for mapping input sequences to the target ones. Generative modeling burgeoned with sequence to sequence models and attention-based mechanisms. Pre-training a model (in an unsupervised way) for a generic task, and then extending it for more specific tasks (in a supervised way), has achieved successful results in recent times. These models are called Language Models, and these approaches remove the dependency on labeled data for pre-training as they are self-supervised. ELMO, BERT, OpenAI GPT are some of the groundbreaking language models.

In this article, we’ll be discussing OpenAI GPT-2 which is a successor of the OpenAI GPT. It essentially combines both the aforementioned approaches (unsupervised pre-training and supervised fine-tuning) in a multi-task manner. We’ll discuss this in detail in the coming sections.

OpenAI GPT-2 has a Transformer-based architecture (Vaswani et. al.). Some background of the Transformer is recommended for understanding GPT-2. I have covered the Transformer in this article. You can give it a read for a better understanding. And as mentioned earlier, it is an extension of the OpenAI GPT model (Radford et al.). You can refer to my article on OpenAI GPT if you’re interested.

GPT-2, like any other autoregressive language model, maximizes the joint probability p(output | inputs). However, there is a small tweak here. As the paper proposes leveraging multi-task learning, the objective is to model:

p(outputs | inputs, task)

Most of the multi-task models involve task conditioning at an architectural or algorithmic level (for eg. dedicated parallel encoders and/or decoders for different tasks). However, language gives us the privilege to implement the multi-task objective at a data-level. For example, a translation example can be of the format, “translate to french, english text, french text” in the original document itself. Or a reading comprehension task sample could be of the format, “answer the given question using, document, question, answer”.

Note that, this is NOT a fixed format that the text should necessarily be in, just an example of how the training sample can itself provide supervision. We will see some detailed examples of this in the next sub-section.

So GPT-2 proposes that taking data that is in such formats, and just training the model for a text completion objective (which is the core autoregressive language modeling objective) should suffice to train the model for all the underlying objectives. Also, since the unsupervised and supervised objectives are the same, the global minima for the unsupervised objective are the same as the global minima for the supervised objective. So the ultimate problem boils down to whether or not we’re able to achieve convergence for the unsupervised objective.

Moreover, since the model isn’t at all trained specifically for any of the underlying tasks (translation, summarization, question-answering, etc.), it is said to be an example of zero-shot task transfer.

Our model is not trained on any of the data specific to any of these tasks and is only evaluated on them as a final test; this is known as the “zero-shot” setting

OpenAI Official Blog

Examples of Naturally Occurring Demonstrations of English-French and French-English Translation found throughout the WebText Training Set via OpenAI GPT-2 Paper.

Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.

OpenAI GPT-2 Paper

The authors have created a new scrape to extract data of such document quality. It is called WebText and it contains over 8M documents(40GB of text).

Word-level input representations pose the issue of ending up with very large vocabularies and still having many out of vocabulary (OOV) words. If we opt to choose character-level representations to encounter this issue, we may end up with huge sequences and the model would struggle with attending the long contexts.

Byte-Pair Encoding is the middle ground between word-level and character-level encoding. Hence the unknown character, “<UNK>” rarely occurs in the WebText tokenization in GPT-2.

You can refer to this blog for Byte-Pair Encoding.

As mentioned earlier, GPT-2 implements a Transformer-based architecture and extends the OpenAI GPT model with some improvements:

  • The Layer Normalization from the Transformer is moved up to the input of the each sub-block (pre-norm):
Pre-Norm vs Post-Norm from Wang et. al.
  • An additional Layer Normalization is added at the end of the final self-attention block.
  • The weights of the residual layers are scaled by a factor 1/N, where N is the number of residual layers.
  • The vocabulary and the maximum sequence length is expanded to 50,257 and 1024 respectively.

Fun fact: The largest variant of GPT-2 was the largest model of the time it was developed; with 1.5B parameters!

Four Variants of OpenAI GPT-2 from the Paper

GPT-2 is trained on the WebText dataset with a purely Language Modeling objective. It has achieved the state of the art results in many downstream datasets without even training on them. A few results are mentioned below:

OpenAI GPT-2 on Natural Questions “one of these questions appear in WebText” from the Paper.
Conditional Generation of an Out-of-Distribution Context from the Paper.
English-French and French-English Translation from the Paper.

We have discussed the architecture of one of the most important models in Language Modeling. We saw the multi-task approach, and how differently a multi-task learning problem can be formulated in the domain of NLP. We also discussed the architectural improvements done on the previous flavors of the model. We covered the WebText dataset, created by the authors of the GPT-2 for multi-task facilitation. And finally, we glanced over the near-human results obtained from the GPT-2.

OpenAI GPT-2 Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Spread the word

This post was originally published by Rohan Jagtap at Medium [AI]

Related posts