# FOUNDATIONS OF MACHINE LEARNING

# Course info

# Course materials

# FAQ

Course materials

/

Section

# Classifying Digits with Gaussian Mixture Models

The MNIST dataset (Lecun et al., 1998), which is widely known in the machine learning community, consists of 28$\times$28 pixels grayscale images of handwritten digits. The dataset consists of 60 000 training and 10 000 test observations in total. While the digits in the MNIST dataset range from 0 to 9, you will consider only digits 6, 7 and 8 in this problem for brevity. You can see some examples in the figure below.

Assume that you have part of the MNIST dataset available on your computer, in total 1180 images with labels 6, 7 and 8. You are curious to see if you can create a QDA (Quadratic Discriminant Analysis) classifier that recognizes images based on which number they contain.

You split your data into two sets and keep 973 of the handwritten digits for training and save the rest for the purpose of testing your model. The images in your dataset have been “flattened” by stacking all pixel values in long vectors, such that all data points are vectors with length 784 (=28$\times$28). This is not necessary (for instance, recall the CNN model from section 5), but we will use the vector representation in this example for simplicity. To simplify things even further, and for the purpose of visualization, you decide that you should reduce the number of input variables from 784 to 2 with the help of Principal Component Analysis (PCA), before applying the QDA classifier.

Which of the following statements about QDA are **true**?

## Principal Component Analysis (PCA)

The idea behind dimensionality reduction is that we can compress the data into a lower-dimensional space, simplifying the problem at hand, and still (hopefully) have enough information to make inferences from it. In other words, the assumption is that there is some redundancy in the original data space and that the *effective data dimension* is smaller than the original dimension $p$. To illustrate the idea, consider first that a certain feature $x_j$, $j \in \{1,\dots,p\}$ is constant for all data points. Then, clearly, this particular feature carries no information about the general patterns in the data and could be dropped, effectively reducing the dimension of the input from $p$ to $p-1$. Similarly, even if the feature is not exactly constant across the data points, but if the variation is small, then we could drop this feature without any significant loss of information.

This is essentially what is done by PCA, but instead of simply looking for the individual features with small variation, the method looks for certain directions (i.e., linear combinations of features) in the input data space in which the data has low variance. Specifically, we reduce the dimension of the data array from $p$ to a new dimension $q < p$, by finding $q$ orthogonal directions in the original data space on which the data points are projected. The projection is done in the directions along which the data varies the most in order to, intuitively speaking, preserve as much information as possible. The hypothesis is that if there is little variation (i.e. low variance) across the data points in a certain direction, then there is no point keeping track of how the data varies in this particular direction. Similarly, if there is a lot of variation (i.e., high variance) in a certain direction, then it is important to keep track of.

As an illustration we consider figure 10.11 from the course book.

In this figure we have a dataset with $p=2$ dimensions (left plot). The PCA method will find two orthogonal directions, shown by arrows in the right plot. The first principal axis (red arrow) is the direction in which the data varies the most. The second principal axis (green arrow) is the next direction in which the data varies the most, but with the constraint that it has to be orthogonal to the first. Since the original data space is only 2-dimensional in this example (for illustrative purposes) there are only two principal axes, but the same idea can be applied to arbitrary $p$ in which case the method finds $p$ orthogonal principal axes of decreasing relevance.

It is possible to reduce the dimension of the data by projecting the data points on a few principal axes. In the right plot we have projected on the first principal axis, resulting in the red points. These can thus be viewed as a 1-dimensional representation of the original 2-dimensional data.

Fill in the missing lines in the code below to reduce the dimension of your dataset using PCA from 784 to 2. Use the `PCA`

class from scikit-learn. Notice that centering is performed as part of the dimensionality reduction, so you do not have to mean-center the data beforehand. The `PCA`

class has two functions that you should use:

`fit_transform`

which takes a data matrix, uses this to learn the principal directions, and then transforms the data matrix by projecting on the first`n_components`

directions.`transform`

which takes a data matrix and transforms this by projecting on the first`n_components`

directions that have already been learned.

*Hint:* As always, you should not use the test data to learn the model!

After you have run the code, you should obtain a visualization of your dimensionality-reduced training data. The code also prints out how much of the variance in the data is explained by each principal component.

*Remark*: For larger data arrays, the PCA class uses a randomized method to find the reduced data dimension, so keep the `random_state`

parameter at 0. This ensures that you get the same results every time you run the code.

It is time to fit the QDA model to the data, following method 10.1 of the course book. In this example we will have a closer look at the inner workings of the method than in the previous sections and you will get to implement your own training algorithm. This is done in order to shed additional light on the key principles behind the method and how it can be adapted to the semi-supervised setting later on. However, we note that scikit-learn comes with off-the-shelf implementations both for the fully supervised QDA classifier and the unsupervised Gaussian Mixture Model model, and there are also other options available.

Fill in the equations for estimating the prior class probabilities and the cluster means of the QDA classifier in the `fit_qda`

function below. Run the code to fit a QDA classifier to your data. At the end, the code will visualize the clusters predicted by your classifier. Each cluster is represented by an ellipse with radii corresponding to two standard deviations. The ellipses should theoretically cover approximately 95% of samples from the corresponding Gaussian distributions.

Now, let’s see if we can use the QDA classifier to make predictions on the test dataset. Finish the `predict_proba`

function such that it takes model parameters and a data array as input and predicts the probabilities $p(y=m\mid \mathbf{x})$ over all classes $m$ and for each input $\mathbf{x}$. Run the code to make predictions and to use these predictions to calculate the accuracy of your classifier on the test set.

*Hint:* to find $p(y\mid \mathbf{x})$, remember Bayes’ rule:

Use the *multivariate_normal* class from the `scipy.stats`

module to calculate $p(\mathbf{x} \mid y)$.

You locate even more images of handwritten digits on your computer that could potentially be used to boost the performance of your classifier. However, these images do not have accompanying labels and labeling 4027 images is a tedious task. Luckily, you know how to expand your training algorithm to also include unlabeled data points.

You will implement a function similar to the `fit_qda`

function from Question C, but that also uses unlabeled data. As explained in the course book, this algorithm will work in an iterative fashion, alternating between the following two main steps:

- Make predictions for the unlabeled data (E-step)
- Estimate model parameters (M-step)

We will implement functions for performing each of these two tasks separately. Once that is done, you will use these functions as building blocks in the final training algorithm. You can find a summary of the learning procedure in method 10.2 of the course book.

The `e_step`

function in the code block below is responsible for the E-step of the semi-supervised training algorithm. This function takes a data array as input and uses the previously implemented `predict_proba`

function to make predictions over the array. Study the code and run it before moving on to the next step.

The second step of the semi-supervised training algorithm is the M-step, in which model parameters are updated. This is done using both the labeled and the unlabeled data points, where the previously predicted class probabilities (E-step) over the unlabeled data points can be viewed as a type of “pseudo-labels”.
The `m_step`

function in the code block below has been created for this purpose. The input of this function is a data array (`X_train`

) as well as a matrix `w`

. This matrix will be calculated according to equation 10.10a in the course book. That is, if the data point at entry $i$ in the data array is unlabeled, then the element `w[i, m]`

is equal to the predicted probability of $\mathbf{x}_i$ belonging to class $m$. If data point $i$ is labeled, then `w[i, m] = 1`

if the observation belongs to class $m$ and `w[i, m] = 0`

, otherwise. On a closer look, you will notice that the `m_step`

function has some missing entries. Your task is to remedy this.

*Remark:* To help you in debugging your `m_step`

function, the code block also contains a small function test. This test will be run once you run the code and will print an error message if your function output does not correspond to the expected one. If your function passes the test, the code will print the message “The function seems to be correct!“. Note that once you submit your code, another more elaborate test will be run to check once again that your code is correct.

*Hint:* Equations 10.10b-d in the course book will be of use.

It is time to put the full semi-supervised training algorithm together. The `fit_semi_supervised`

function takes the labeled and unlabeled data as input and should return the model parameters of a GMM. As mentioned previously, the semi-supervised training algorithm works in an iterative fashion. To determine how many iterations that are enough for learning the model parameters, the function computes the data log-likelihood at each iteration and compares it with the previous value (the `compute_data_likelihood`

and `check_convergence`

functions). If the change in the data likelihood is small enough, the training stops.
To ensure that the algorithm does not run indefinitely in case convergence does not occur, the function has a `max_iter`

attribute which puts an upper limit on the number of iterations that will be run.

The `fit_semi_supervised`

function has two empty lines and your task is to add the`e_step`

and `m_step`

functions (with appropriate inputs) from the previous code blocks in the correct places. As in Question E, the code block below contains a function test, testing the final function on a small subset of the training data. The purpose of this test is to help you to discover and correct possible errors in your function before submitting your solution (which takes a bit longer than simply running the code). If correct, the test run should converge after 5 iterations.

*Remark*: Notice that the function initializes the model parameters from a supervised QDA classifier.

The code block below loads the full training data, including both labeled and unlabeled observations, trains a model using the training algorithm from the previous question and evaluates the final model on test data. The data has been reduced to two dimensions already using PCA, so no need to do that here.

Run the code. It will visualise the full training data as well as the final clusters predicted by the GMM. It will also print the data log-likelihood at every iteration of the training algorithm, the final model parameters and test accuracy. Do you observe an improvement in test accuracy compared to the previous model?

Hopefully, you were able to boost the performance of the model with the help of the unlabeled data. Labeling of a dataset can be a time-consuming task and semi-supervised methods can be used to increase model performance without extra time spent on adding labels to the data.

In this example, you used PCA to significantly reduce the dimensionality of the original dataset before fitting a model to the data. It turns out that the task of separating black and white images of a smaller set of digits is feasible even in this low-dimensional space. Under other circumstances, we might not want to reduce the dimensions of the data as much. For instance, one drawback of dimensionality reduction is that we lose the semantic meaning of the input features. For the MNIST images, the original features are grayscale values, describing the brightness of individual pixels. In the reduced space, we can not interpret the input features in the same way, which in turn reduces the interpretability of our model.

It is worth noting that semi-supervised learning can be used also together with more complex models, such as convolutional neural networks, which mitigates some of the issues mentioned above. However, such methods are beyond the scope of this course.

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, *86*(11), 2278-2324.

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.

The content is licensed under Creative Commons Attribution 4.0 International.

Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten