FOUNDATIONS OF MACHINE LEARNING

Course info

Course materials

FAQ

Course materials

/

Section

(Alternative B) Generative Language Models


Generating Text Using (Large) Language Models

In this chapter, you will learn about language models and how such models can be used to generate text. An example of a language model that has gained much attention recently, and which you’ve likely heard of, is ChatGPT (OpenAI, 2023). ChatGPT is a an “AI” tool, that generates text based on user-provided prompts, often in a very realistic and human-like manner. This model is trained on enormous amounts of data and has trillions of model parameters (or so the rumor goes). While you will not create a ChatGPT-copy in this chapter, we hope that you will get a basic understanding of how a model like ChatGPT generates sequences of text.

Remark: The course book does not cover the topic of (generative) language models. If you want to learn more, we instead recommend that you have a look in the book Speech and Language Processing (Jurafsky & Martin, 2023), available freely from the authors’ webpage. In particular the contents of the 7th chapter (Neural Networks and Neural Language Models) of the book are the most relevant for this section of the course. Also chapters 9 and 10 provide interesting reading if you want to dive deeper into more advanced neural language models.


Probabilistic Models for Generating Text

We will focus on one specific way of generating sequences of text, namely word by word. At the core of this type of text generation is a probabilistic model, which gives us probabilities for possible next words in a text sequence, based on the previous words in that sequence.

Let’s assume that we use such a probabilistic model to generate a text sequence. If the sequence so far consists of $t-1$ words, word $w_1$ up to word $w_{t-1}$. The probability of the next word, $w_{t}$, conditioned on the previous words can be written as

Given the previous words $w_{1}, w_{2}, \ldots, w_{t-1}$ as input, the model will assign such probabilities to each word, $w_{t}$, in a fixed vocabulary, a set of pre-determined words that the model recognizes. Any word that is not inside the vocabulary, will not be considered by the model. That is to say, the vocabulary determines what type of text sequences that the model can generate.

Using our probabilistic model, we can predict the next word, either deterministically, by choosing the word with the highest probability, or stochastically, by sampling the next word based on the probabilities predicted over the full vocabulary. The probability of a full text sequence of $T$ words can be calculated as


Question A: Stochastic Text Generation

You are using a probabilistic model to generate the next word in the text sequence here I. Besides the words here and I, the vocabulary contains the words am, is, sat and you. We here use the notation $p(w_{3} \mid \text{here}, \text{I})$ as a shorthand for $p(w_{3} \mid w_{1}, w_{2})$ with $w_{1} = \text{here}$ and $w_{2} = \text{I}.$ The (conditional) probabilities predicted by the model are given by the following table:

$w_{3}$ $p(w_{3} \mid \text{here}, \text{I})$
am 0.589
here 0.011
I 0.017
is 0.002
sat 0.277
you 0.104

If the text sequence is generated stochastically, which of the following statements are true?


The n-Gram Model

That the next predicted word should depend on the previous words in the text sequence is intuitive. For example, as a human, you would probably assign a high probability to the word flower following the sequence He watered the, but a very low probability to the same word if the sequence said He is the. This, of course, will also depend on the context and on the purpose of the model. But in most realistic settings, the text sequence He is the flower would be relatively unexpected.

The question that remains is how far back in the text sequence the dependence should reach. In so-called n-gram models, it is assumed that the probability of the next word depends on the $n-1$ previous words. The name of these models originates from the word n-gram, referring to a sequence of $n$ items. In an n-gram model, even if the text sequence so far is longer than $n-1$ words (i.e. $t > n-1$), only the $n-1$ most recent words are used to calculate the probability of the next word.

The simplest n-gram model is the unigram model (corresponding to $n=1$, also called a bag-of-words model). In this model, the next word is assumed to have no dependence on previous words, and the probability of the next word is simply the marginal probability $p(w_{t+1})$. In the bigram model (corresponding to $n=2$), the probability of the next word depends only on the most recent word according to $p(w_{t} | w_{t-1})$. In the trigram model ($n=3$), the probability of the next word depends on the two most recent words as $p(w_{t} | w_{t-2}, w_{t-1})$, and so on. For a general $n$, an n-gram model predicts $p(w_{t} | w_{t-(n-1)}, w_{t-n} \ldots, w_{t-1})$. The probability of a full sequence of $T$ words for this model is

For example, consider the sequence Hi how are. For this text sequence, the unigram model would use no information from the sequence to predict the next word, the bigram model would use only the word are and the trigram model would use both how and are (in that order). It is easy to see that the probabilities provided by the trigram model will be far more informative than the ones provided by the unigram model and also more informative than the bigram model. In general, the more information that is used to calculate the probability of the next word, the better we might expect the predictions to be. But there is a trade-off, the larger the $n$, the larger the number of n-grams that can be constructed using the vocabulary. The more possible n-grams, the more complex our model will need to be to accurately model the probability $p(w_{t} | w_{t-(n-1)}, w_{t-n}, \ldots, w_{t-1})$.

Illustration of n-gram models predicting the next word in the text sequence 'Hi how are'
Illustration of n-gram models predicting the next word in the text sequence 'Hi how are'

The careful reader might notice that for $t < n$ in the n-gram equation above we would end up conditioning on negative word indices. Indeed, for an n-gram model we have a problem with the first $n-1$ words in a sequence as there are not $n-1$ previous words to condition on. As an example, consider modelling the probability of the first word in a sequence using a bigram model. Since there is no previous word, it is unclear what the distribution should be conditioned on. This problem is often handled by introducing artificial starting tokens at the beginning of a sequence. We will return to this with concrete examples in the code exercises below.


Question B: About n-Gram Models

Which of the following statements are true?


Word Embeddings

The input to our n-gram model consists of the $n-1$ previous words in the current text sequence. But, machine learning models do typically not accept text strings (the way we usually represent words) as input, but requires the input to be numerical. In order to translate the words in our vocabulary into something that our model understands, we can use so called word embeddings. A word embedding is a numeric representation of a word, constructed in a way such that words of similar meaning has a similar numeric representation. We can imagine how we put each word in the vocabulary into a vector space, where words of similar meaning will lie close together, while words of very different meaning will lie far apart.

There are many different kinds of word embeddings, but we will focus on one type which relies on the use of machine learning models. The general idea is as follows: take a machine learning model, such as a deep neural network, and train it for some task, for example classification or even text generation. Then, use the learned representations of words in the model to construct the word embeddings.

One method that uses this general idea to create word embeddings is word2vec, which we will also use in the coming code example. Central to this method is a binary classifier, that is trained on the task of recognizing which words that appear in the same context and which that do not. Take the word elephant as an example. When presented with the word elephant, together with another word desert, say, the task of the classifier is to answer the question Does the word desert appear in the same context as the word elephant?. When presented with many such pairs of similar and dissimilar words, trained in a so-called self-supervised fashion, we eventually have a model that encodes useful information about the words. The model itself makes use of word embeddings, which are learned during training to provide similar (dissimilar) word embeddings for similar (dissimilar) words. It is these word embeddings that we can then extract and use for other tasks, such as text generation.


Building Your Own Trigram Model

With knowledge of some of the key concepts underpinning language modelling, it is now time to actually build and train a language model. The rest of this section will cover an example of implementing a trigram language model, training it and using it to generate text. This will demonstrate how many of the concepts discussed above are used in practice.

In order to train such a model it is necessary to use a large text dataset, called a corpus. We will here use the Reuters corpus (Reuters-21578, Distribution 1.0, source), containing news documents that appeared on the Reuters newswire in 1987. One should note that any language model will be completely dependent on the data it has been trained on. As such, we should expect our language model to produce text that fit in a news context in 1987. We should always have a critical mindset towards what data machine learning models are trained on and be aware that the models will produce outputs based on whatever is in that data, for good or for worse.

When working with text data, there is usually many pre-processing steps necessary to clean up the dataset and turn it into a useful format. Most of this has already been done for you, but it is good to have an idea of what kind of pre-processing that can be useful. Some of the processing steps done to the original data include:

  • Remove numbers and make all words lower-case.
  • Count words and define a vocabulary containing only words occuring at least 200 times.
  • Add start tokens <s> at the beginning and stop tokens </s> at the end of each sentence.
  • Turn the entire corpus into trigrams.
  • Filter out trigrams containing words not in the vocabulary.
  • Assigning an index to each word in the vocabulary, and changing trigrams to consist of word indices instead of the actual strings.

The pre-processing described above is chosen to keep things simple. There are many other steps one can perform, for example more sophisticated approaches to handling out-of-vocabulary words. Careful pre-processing has historically been very important for constructing useful language models. Most recent large language models do however work with representations that are closer to the raw text data, making extensive pre-processing unnecessary.

One important step above is the introduction of start and stop tokens. The use of start tokens is a convenient way to handle the fact that our model always works with trigrams, but at the start of a sentence there are no previous words. With start tokens, the first trigram in a sentence will look like (<s>, <s>, $w_1$). This will be useful when we want a language model to generate the first word in a sentence, as the model input will then be (<s>, <s>) and we can sample words from $p(w_1 |$<s>, <s> $)$. Stop tokens </s> instead solve the problem of knowing when a sentence ends. With stop tokens the last trigram in a sentence will look like ($w_{T-1}$, $w_{T}$, </s>). Whenever we generate a stop token from our language model this means that the sentence is over, so we can replace the token with a ”.” and start generating the next sentence.

Run the code cell below to load the pre-processed text data.

The code loads 4 variables:

  • vocab: A 1-dimensional numpy-array containing all the words in the vocabulary.
  • word_to_index: A dictionary that maps each word (as a string) to its index in the vocabulary.
  • embeddings: A numpy array of shape (n_vocab, embedding_dim) containing word embeddings for each word in the vocabulary.
  • tri_grams: A numpy-array of shape (n, 3) containing the trigrams that we will use to train the model. Remember that we have formatted these to contain word indices rather than strings.

To better understand the format of the data, we will go over some examples on how to work with these variables. We will start by inspecting a trigram from the loaded data:

Note how we can use vocab to translate from word indices to the actual words as strings.

Using word_to_index we can get the word index for any word in our vocabulary:

We have also loaded the word embeddings for all words in our vocabulary. See the code below for how we can index embeddings and extract the embedding vector for a word:

The examples above should give you an overview on how to work with these variables. Now it is your turn to try it out in the exercises below.


Question C: Look Up a Word

In our vocabulary, what is the index of the word “stock”? Assign your answer to the variable stock_index.


Question D: Look Up a Trigram

What word comes before “stock” in the trigram at row 17101 in tri_grams? Assign your answer to the variable word_before.


Neural Language Models

Many different kinds of machine learning methods can be used to model language. The most popular choice today is to use neural language models, that are based on neural networks. Also within the family of neural networks, language models can be constructed using different kinds of network architectures. A family of architectures called Recurrent Neural Networks (RNNs) have seen much use as language models in the past (Mikolov et al., 2010). As of late, most neural language models are instead based on the Transformer architecture (Vaswani et al., 2017) (more on this later).

To focus on the core concepts, we will base our language model on a simple standard neural network architecture (an MLP, of the same type as in section 5.2). The neural network will take as input the first two words of a trigram (the context words) and then try to predict the following last word. Of course the neural network can not take words (in the form of strings) as input, but this is where the word embeddings come into play. Before feeding the context words to the model these will be embedded as vectors and also concatenated to create a single input. For the model to be capable of predicting any word in the vocabulary it has to have n_vocab outputs, as many as there are words. The final layer of the network will apply the softmax function, meaning that its output can be interpreted as a probability distribution over the vocabulary. For our trigram case this model output corresponds to the conditional distribution with probabilities $p(w_{t} | w_{t-2}, w_{t-1})$. A schematic of how the model produces its prediction is shown in the figure below:

Overview of our neural language model.
Overview of our neural language model.

Training our neural language model using the trigrams then reduces to a supervised classification problem. The two context words are the inputs, and the final word (or specifically its word index) is the target class.


Training the Language Model

We are now ready to implement our neural language model. For this we will utilize the MLPClassifier class from scikit-learn. This class is similar to the MLPRegressor used in section 5.2, but is used for classification instead of regression problems. We will first need to format the training data to work with the methods of the MLPClassifier class. To do this, we will separate the context words and label in each trigram. The context words will then be embedded and concatenated, as described above.

Once we have formatted the data in this way, we can instantiate and train the model. Some of the arguments used in constructing the MLPClassifier are worth taking note of:

  • hidden_layer_sizes=(128,128): We will use an MLP with two hidden layers of size 128.
  • validation_fraction=0.1: 10% of the data is reserved as a validation set. The accuracy on this set is printed after every epoch.
  • early_stopping=True and n_iter_no_change=2: Early stopping is used, meaning that if the validation accuracy is not improved for n_iter_no_change=2 consecutive epochs the model training is stopped.

See also the MLPClassifier documentation for more information about these arguments as well as other arguments used in the MLPClassifier.

Run the code block below to train the language model. Note that we are working with a substantial amount of data, so training this model can take a few minutes. Feel free to keep reading below while you wait (but note that you will not be able to run any more code until the training is done).


Generating Text Using the Model

We will now use our trained language model to generate text. The generation is done stochastically, where each new word is sampled from the probability distribution output by the model. The main advantage of stochastic text generation over deterministic generation is that we achieve a greater variety in the output. With stochastic generation we can keep sampling sentences from only start tokens and possibly get different output every time.

In the code below the function generate_sentence has been implemented for you. This function uses a trained language model to generate a sentence. On a high level, the generation procedure can be described as:

  1. Start from a sentence only containing two start tokens (<s>, <s>).
  2. Feed the last two words of the current sentence to the model as context.
  3. Sample the next word from the probability distribution output by the model.
  4. Append the sampled word to the sentence.
  5. If the sampled word was a stop token </s>, or the sentence has reached the maximum length, then return, otherwise repeat from step 2.

Note that a maximum sentence length is enforced in the generation. This is neccesary because there are no guarantees that the model will generate a stop token in a reasonable number of words, possibly resulting in extremely long sentences. The generation function also features an option to feed a text prompt as input, that the model will then start generating from.

Read through the code below to understand how this text generation scheme looks like when implemented. Note how the different steps in the code correspond to the steps described above. Then run the code to define the function.

Run the code block below to stochastically generate a sentence. You can run the code as many times as you want to generate different sentences.

You might immediately note that most of the generated sentences are nonsensical. Still, the text often flows similar to correct english and pairs of words following each other are often grammatically correct. Even though there might be a sense of local coherence, when we consider the whole sentences they are generally not coherent and often jump between completely unrelated topics. This can to some extent be attributed to the MLP language model not being very powerful, but primarily to the fact that we are using a trigram model. As the model always just takes the two last words as input there is no way to retain a coherent thread throughout the text.


Generating Text Based on Prompt

The generation function can also take as input a prompt (start of a sentence) to generate from. Generating text from a prompt is often more useful than just outputting any random sentence. When we start from a prompt we can direct the model to generate text about a specific topic. By running the code block below the model will generate sentences starting with “The stock”. As before, we are sampling stochastically, so we can get different sentences by just re-running the code. Feel free to play around and change the prompt. Note, however, that we require all the words in the prompt to be part of the vocabulary. This is because the prompt (or at least its last two words) will be used as context for the model and we need to have its word embeddings available.

It should be noted that this type of prompting, where the start of the sentence is provided to the model, is a simple setup that works with our relatively simple model. In more recent and complex language models, there exists much more involved notions of prompting (Liu et. al., 2017). This includes the possibility to provide descriptions of the style of writing to generate text in or pre-defined answer templates where the model only fills in a few words.


Question E: Predicted Probabilities

In order to be able to generate interesting text it is crucial that our model output represents a probability distribution over the vocabulary. In this exercise you will explore this probability distribution, for a specific prediction of the next word.

Given that a sentence begins with the two words “the stock”, find out:

  • What is the most probable next word, according to the model?
  • What is the probability of the next word being “fell”, according to the model?
  • What is the probability of the next word being “rose”, according to the model?

Assign your answer to each question to the corresponding variable in the code block below. Remember that you can use the variables vocab and word_to_index that were loaded previously.


Large Language Models

You’ve now seen how we can train and use a relatively simple language model. Our model has 152 922 parameters and is trained on 128 375 trigrams. Modern language models, usually referred to as large language models, are not only vastly larger than the one implemented here (with billions, or even trillions of parameters) and trained on much larger data sets, but also comes with additional features to boost model performance. For one, large language models are commonly based on transformers (Vaswani et. al., 2017). For instance, the abbreviation GPT stands for Generative Pre-trained Transformer. Transformers are a type of deep learning models, that use a special attention mechanism, to determine which parts of an input that are important for the task at hand. In the context of large language models, the attention mechanism can be used to determine which parts of the current text sequence that are important for predicting the next word in the sequence. In other words, we do not need to decide beforehand which words in the text sequence that should be used in the prediction, like we need to in the n-gram model.

In more technical terms, we can view the attention mechanism as a dynamic filter that is learned during training along with the model weights. The filter is dynamic in the sense that it adapts to the current input (text sequence in the case of a language model) and it is flexible enough to focus on different parts of a text sequence depending on the content of that specific sequence. A comprehensive and accessible overview of transformers is given by Alammar (2018).

Apart from being based on transformers, large language models are usually trained on a vast amount of data and some, like ChatGPT, are trained also with the help of human feedback.


Outro

The goal with this section was for you to get some insight into the workings of (large) language models. To this end, we have covered some important aspects of such models, like how they use (conditional) probabilites to predict the next word in a text sequence, and how to process text data before it is fed to the model. The language model implemented in this section have many similarities with large language models, but it is much simpler. It is evident that the text sequences that the model generates are often unrealistic and gramatically incorrect. In contrast, advanced language models like ChatGPT are capable of generating relatively long sequences of both gramatically correct, realistic, and informative text. However, it is good to be aware that, although seemingly without error, language models (and ChatGPT among them) do not by construction have any sense of whether the text that is generated is actually true or not. Whether such a notion can emerge during the training process is a harder question to answer, and something that is widely discussed. In any case, it is good to keep an eye out for faulty information when using a language model like ChatGPT as a tool.

We encourage you to put in some extra time, and use the provided code blocks to play around with the implemented language model. For example, try out some more prompts for generating text sequences. Or why not retrain the model using other hyperparameters, to see how the choice of hyperparameters affects model performance. Perhaps you have something else that you would like to try out? Don’t hesitate!

OpenAI. (2023). ChatGPT: GPT-3.5 Based AI Assistant. https://chat.openai.com/

Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft). https://web.stanford.edu/~jurafsky/slp3/

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. Interspeech 2010.

Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.

Alammar, J. (2018). The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten