This post was originally published by Divish Dayal at Towards Data Science
tldr; this is a primer in the domain of unsupervised techniques in NLP and their applications. It begins with the intuition behind word vectors, their use and advancements. This evolves to the centerstage discussion about the language models in detail — introduction, active use in industry and possible applications for different use-cases.
In the fledgling, yet advanced, fields of Natural Language Processing(NLP) and Natural Language Understanding(NLU) — Unsupervised learning holds an elite place. That’s because it satisfies both criteria for a coveted field of science — it’s ubiquitous but it’s quite complex to understand at the same time.
I will attempt to break down my experience and knowledge in this space into simple blocks, hoping you can understand more about the field and even more so — have better and more structured intuitions about solving and dealing with NLP problems.
My work has largely been about working with unsupervised datasets, i.e. datasets with no labels or target variables. It’s often the case in the industry where you receive a business problem first, and then brainstorm and ideate over possible solutions. Sometimes, you come up with innovative datasets and labels to solve your problems. Often, the labels don’t exist and you are left to deal either with Mturkers(for the uninitiated, Mturk is a crowdsourced data-annotating platform) or solve the problem without any labels at all, i.e. use unsupervised techniques.
Let’s dive into the two most essential, and quite ubiquitous, sub-domains of word vectors and language models. Along with introducing to the basic concepts and theory, I will include notes from my personal experience about best practices, practical and industrial applications, and the pros and cons of associated libraries.
Why vectors for words?
Representation of words as a vector — array of 50–300 float values is one of the biggest leaps in NLP, also one that’s the easiest to understand.
Before the now ubiquitous word vectors, the words in the vocabulary were vectorized using the traditional one-hot vector encodings as shown below. The figure shows one-hot vectors for a vocabulary containing 3 words in the sentence: ‘I love NLP’. This method is still widely used in ML algorithms like TF-IDF, and is prevalent in digital circuits.
Image by Author: one-hot encodings: here the word “love” is represented by the vector [0, 1, 0]
At the turn of the last decade, word vector models came around with publications such as word2vec and Glove. With them, the NLP domain was imminently transformed into adopting them for virtually all possible tasks.
How are word vectors trained?
The word vector models come in two flavors — skip-gram model and continuous bag of words(CBOW) model, as shown in the figure below. Both of these models are fundamentally based on the same underlying principle — information about a word lies in the context it is used in. For example, the words ‘Man’ and ‘Woman’ are used in very similar contexts like ‘Man can do something’ vs. ‘Woman can do something’. These contexts, over millions of sentences and billions of tokens, generalize with the model learning that ‘Man’ and ‘Woman’ are related in their usages, while ‘Man’ is associated with ‘he/him’ and ‘Woman’ is associated with ‘she/her’. So, over large datasets, the word-vectors start to make a lot of sense based on these associations that are formed over their use in a variety of sentences.
The CBOW architecture tries to guess the current word based on the context, whereas the Skip-gram guesses surrounding words given the current word. For example, in the sentence — “climate change is affecting nature adversely.”, a CBOW model will try to predict the word affecting given the context, i.e the other words in the sentence. The following figure illustrates both of these methods.
Base Image from publication — Efficient Estimation of Word Representations in Vector Space : Modelling variants for training word-vectors.
When you train such a model over billions of token over a large dataset containing web articles and so forth, what you get is a very potent representation of each word in the vocabulary in the form of a vector. These vectors can be 300 dimensions long, i.e. each word is represented by 300 real numbers. The most famous example to explain these vectors is shown in the figure below.
Image by Author: An example illustrating word-vectors when visualized in 2 dimensions.
Based on the figure, the following vector equation supposedly holds true:
King_vec - Man_vec ~= Queen_vec - Woman_vec
Or basically, in other words — vectors project similar relationships for those pairs.
Image from publication — https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf: Word vectors exhibiting relationships of various countries with their capitals projected in 2 dimensions
Word vectors can exhibit interesting characteristics useful in real-world applications. With word-vectors, machines are able to understand and process text in a more human-like way. This understanding of words, and text in general, extends to other forms of media such as speech, images, and videos which are often transformed first to text and then processed further. More on this later.
In a recent paper  authored in 2017, there was a huge improvement in the quality of representations of word-vectors. This new method utilized the so-called sub-word embeddings for the construction of vectors. There was a focus on the morphology of the words by breaking down a word into sub-words, or character n-grams. Let’s take an example of the word “where”. It is broken down into the following n-grams taking n=3:
where -: <wh, whe, her, ere, re>
Then these sub-word vectors are combined to construct the vectors for a word. This helps in learning better associations among words in the language. Think of it as if we are learning at a more granular scale. This helps in learning phenomenon even inside the words, on their lemmas. For example, the difference between ‘cat’ and ‘cats’ is similar to other such pairs as ‘dog’ and ‘dogs’. Similarly, ‘boy’ and ‘boyfriend’ have the same relation as ‘girl’ and ‘girlfriend’. This methodology also helps in creating more meaningful representations for OOV(out of vocabulary) words that the model has not yet seen in the training set.
How are word-vectors useful?
I discuss the applications of word-vectors in greater detail in this other post by me. Briefly — word-vectors are useful for quick calculations, especially if computing resources are limited. Finding pre-trained word-vectors on a variety of corpora(new, web, social media like Twitter and Reddit, and so forth) is easy. You would want to use the word-vectors that are trained on a dataset closest to your application dataset. For eg., the word vectors trained on twitter dataset will be different from the ones trained on news articles.
Word vectors can be used to construct vectors for words or sentences, to use them for similarity or clustering tasks. Even easy tasks like plotting a word cloud for a dataset is a powerful method to analyze a dataset. However, the real power of word-vectors is unleashed with Language Modelling.
Image by Author: Word Cloud generated from this blog post
What are Language models?
Language modeling is a primary tool for all unsupervised NLP tasks in the arsenal of ML engineers. By definition, Wikipedia defines language models aptly as :
A statistical language model is a probability distribution over sequences of words.
So, in simpler words — a language model is used to capture and predict the relationships between words across a sentence or a document. Fundamentally, the language model predicts the conditional probability distribution for the next word in a sentence, given by:-
Using this, the probability of the occurrence of a sentence is given by:-
Long story short, a Language Model is used to learn word associations in a dataset so that it can be used to predict the next word in a sentence, or the validity of a sentence — how probable is the occurrence of this sentence based on the distribution it has learned from the training dataset. So a model can tell that “Hi, how are you doing?” is a more probable English language sentence compared to something whacky like “Hi, goodnight!”.
Image by Author: Language Models are used in phone keyboard applications to predict the next word.
As shown in the figure above, in one of the most obvious uses of a language model — the keyboard applications suggest the next word while we type. This is also how the auto-correct works. Let’s see how this happens in more detail using the following figure.
Image by Author: Working Illustration of a Language Model
The words, or context, of the sentence are transformed into word vectors. These go as inputs into a language model, which is essentially a time-series based neural network. Finally, we get probability values across all the words in the vocabulary as to how likely a particular word is suited to be the next word to the given input context. Practically, vocabularies are huge(~300k or more) and these output probabilities are only significant enough for 10s of words whereas the rest of them have minuscule values(like 0.00001).
Architecturally, Language models have two primary blocks — the encoder and the decoder. As the name suggests, an encoder is used to encode the inputs(word-vectors) using time-series Neural Network models. By time-series, I mean that the model keeps into account the positional order of words in a sentence where word_2 comes after word_1. The Decoder, on the other hand, is the opposite of an encoder, which uses the output of the encoder(feature vector) to spew out words, one at a time — completing a sentence as it loops to the end. This is some deep architectural stuff, that you probably don’t need unless you are coding it yourself. Just a good to know terminologies otherwise.
Image by Author: Typical structure of Language Models
Applications of Language Models
What we discussed is a basic out of the box language model. Language models can be used in a variety of ways in the unsupervised context. Let’s see some of the more popular ones :
- Vectorizing a sentence into a vector. This is a much better way of vectorizing sentences than any sort of averaging of word vectors.
- Sentence/document classification tasks in a supervised setting(where you have target labels). Here, operationally — you use a pre-trained unsupervised Language Model to further train(finetune) on a supervised classification task. In the absence of any labels, you can perform clustering to segment the data for analysis.
- Generation tasks as we saw in the figure earlier. Sentence generation is a rising field with recent headlines by the likes of GPT-3. Generation tasks can vary from generation from scratch for tasks like ads(pencil, persado), games(ai dungeons), news or document summarization(agolo), conversational models, chatbots, keyboard/google autocompletion, question answering, and so forth. Any type of generation is possible if you have a suitable dataset as someone did for Game of Thrones here. Some advanced applications use language models in multimedia settings like captioning of images, a variety of speech-to-text and text-to-speech tasks like Alexa, handwriting recognition, and so forth.
- Machine Translation for translating across languages has come a long way. An application like Google Translate uses language models to convert speech to text and then translate to different languages.
- Information Retrieval tasks on large datasets like entity resolution, aspect-based sentiment analysis, document tagging. Some of the most powerful applications use language models in conjunction with knowledge bases.
These were some of the popular ones. Any data that is sequentially consumed or generated can be modeled with a language model. Music generation using AI is one such application. There are inconceivably numerous applications out there, and it’s expanding by the minute. With heavy research coming from academia and corporations, the modeling power and ability is improving rapidly. The state of the art for related tasks is pushed further with every passing major AI conference — so much so that it’s hard to keep track of the entire domain now — you have to focus on one or a few aforementioned sub-domains now.
An interesting and nascent development in AI has been the concept of few-shot learning. It means that a trained model can learn a new task with only a few examples with supervised information by incorporating prior knowledge.
This is a huge step forward in the field of AI as traditionally, you need incredibly huge amounts of data for learning even simple tasks. In the context of language models, the pre-trained models like BERT and GPT-x models are trained over billions of tokens(>100GB of raw text data) and even then — finetuning these models on specific tasks requires 1M+ data points. Compared to this, few-shot learners can learn new tasks using just a few points per label. This concept elevates to a whole new level with zero-shot learning where instead of data points, only metadata level information about the classes is used as input. It’s a new active area of research still in its early days. It’s a very promising field once we get metrics of accuracy acceptable to be used in the industry.
There is a direct correlation between the accuracy and usefulness of AI models with the dataset and model parameter size.
Image by Author: Various language models plotted against their parameter sizes
This figure shows the leading language models in the period of 2017–2020. The parameter sizes have increased by an order of magnitude every year — BERT-Large (2018) has 355M parameters, GPT-2 (early 2019) has 1.5B, T5 (late 2019) extends further to 11B, finally GPT-3 (mid-2020) has a whopping 175B.
With these, the access to research and training of these models has shifted away to the biggest corporations of the likes of Google, OpenAI, Microsoft, and Facebook. GPT-3 requires 700GB of GPU memory to train, much beyond the 10–16GB memory of normal consumer GPUs, costing over $4.6M to train on cloud Tesla V100 GPUs in parallel. The rest of us mortals are only left to use the smaller versions of these huge pre-trained models for our indifferent tasks. It’s not necessarily a bad thing and is an inevitable consequence of the advancement of AI, but it is something to ponder over and remember in the back of our minds considering the future of this elemental technology.
Looking forward to hearing from you on any critiques, discussions, or any other thoughts you would like to share with me.
 Mikolov, Tomas, et al. Efficient estimation of word representations in vector space.
 Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
 Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
 Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Agarwal, S. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
 Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8), p.9.
 Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J., 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
This post was originally published by Divish Dayal at Towards Data Science