FOUNDATIONS OF MACHINE LEARNING

Course info

Course materials

FAQ

Course materials

/

Section

Introduction


Section Introduction

Neural networks have become a staple of modern machine learning and the growing interest in the area can to a large extent be attributed to the many success stories of this type of model. In one way or another neural networks have been underlying great accomplishments of modern computer science such as:

Neural networks are clearly powerful models and as we can see from these examples they are applicable to a multitude of areas. Practically all areas of computer science have been impacted by machine learning through neural network-based models. Neural networks even come in specific “flavors”, often tailored to work with structured data such as images, text, sounds or videos.

Even though these models have been extensively developed and reached great success, they do not come without their own issues. To begin with, training neural networks generally requires very large amounts of data. Gathering large sets of high-quality data might be hard or expensive in many real-world scenarios. Even when there is enough data to train well performing models, a problematic property of the models is that they lack interpretability. A trained neural network can be really good at performing the task it was trained for, but it is difficult to figure out how it actually achieves this. This means that it can be very hard to find the model’s failure cases or detect possible built-in biases, something that has led to severe problems in practice:

You will learn more about these types of issues in section 7 of this course. For now we will look at the details of the neural network model. We hope that this will give you an understanding of how this type of model can be so effective, but also a hunch about why some of these issues might occur.

Start out with reading chapter 6.1 in the course book, that describes the general form of a neural network. As you will see, neural networks are another example of parametric models. The training procedure for these networks is based on the gradient descent optimization method. Below we provide a short introduction to gradient descent, that should be enough for you to grasp how neural networks are trained. If you are interested in more details you can read chapter 5.2 and 5.3 in the course book, that also include information about other optimization methods for parametric models.

Gradient

Learning the parameters $\boldsymbol{\theta}$ for a parametric model is typically done by solving an optimization problem of the form

for some objective function $J$. In particular, this is the case when training neural networks. One way to solve this type of problem is by the gradient descent method. Gradient descent can be applied to optimization problems of this type as long as it is possible to compute gradients of $J$ with respect to the parameters $\boldsymbol{\theta}$. Recall that gradients, denoted $\nabla$, is the analog of the derivative to functions of multiple variables (For example the function $J$ over the vector $\boldsymbol{\theta}$). The gradient is a vector where each element is a partial derivative. For an objective function $J$, the gradient with respect to a $d$-dimensional parameter vector $\boldsymbol{\theta}$ is:

Just as the derivative of a function $f$ describes how $f(x)$ varies as $x$ changes, the gradient describes how a function $J(\boldsymbol{\theta})$ varies as each element in $\boldsymbol{\theta}$ changes.

A particularly useful property is that the gradient describes the direction in which $J$ increases most rapidly. Note that this is a direction in the $d$-dimensional parameter space. In the same way the negative gradient, $-\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})$, describes the direction in which $J$ decreases the fastest.

Consider for example the simple function $J(\boldsymbol{\theta}) = J(\theta_1, \theta_2) = \theta_1 + \theta_2$. The parameter vector $\boldsymbol{\theta}$ contains only $\theta_1$ and $\theta_2$, so the parameter space is 2-dimensional in this case. The figure below shows the value of $J(\boldsymbol{\theta})$ for different values of $\theta_1$ and $\theta_2$. The gradient $\nabla J(\boldsymbol{\theta})$ and its negation $- \nabla J(\boldsymbol{\theta})$ are also shown as arrows, evaluated in the point $\boldsymbol{\theta} = [0, 0]^\top$.

Gradient example
Gradient example

From the figure we see clearly that if we want to decrease the value of $J(\boldsymbol{\theta})$ as much as possible, we should change $\boldsymbol{\theta}$ in the direction of $- \nabla J(\boldsymbol{\theta})$. In gradient descent we will use this to find values of $\boldsymbol{\theta}$ that minimize the function $J$.

The high-level idea behind gradient descent is to iteratively update $\boldsymbol{\theta}$ in such a way that $J(\boldsymbol{\theta})$ decreases. If we think of $\boldsymbol{\theta}$ as a point in some $d$-dimensional space, this can be done by taking a small step in the direction of the negative gradient. If we then keep on taking such steps, the value of $J(\boldsymbol{\theta})$ should decrease until we get to a minimum. This idea is formalized in the gradient descent update rule:

The constant $\gamma$, often referred to as learning rate, is the length of each small step in the parameter space. Many methods exists for choosing $\gamma$ in more or less principled ways, but often this is left as a hyperparameter to be tuned by the machine learning practitioner.

Let us look at an example of gradient descent optimization in the very simple case of a one-dimensional $\theta$. In this case the gradient of $J$ is simply the normal derivative. For this example we want to minimize

which has the gradient

The objective function $J$ is also plotted below:

Objective function
Objective function

We initialize $\theta^{(0)} = 2$ and run the gradient descent process with learning rate $\gamma=0.1$. The initial value of the objective function is

and the gradient at this point is

We take one gradient descent step by computing $\theta^{(1)}$ using the gradient descent update rule:

This first gradient descent step is illustrated in the figure below:

Gradient descent step 1
Gradient descent step 1

We can compute the value of the objective function at this new $\theta^{(1)}$: $J(\theta^{(1)}) = J(1.6) = 1.72$. As we expected, the value has decreased after the gradient descent step. If we take another step in the same way we will reach $\theta^{(2)} = 1.36$ and an even lower value for $J$, as illustrated in the following figure. If you want more practice with the algorithm, perform the computations for more gradient descent steps manually.

Gradient descent step 2
Gradient descent step 2

As this step process is iterated, the value of $\theta$ approaches the minimum at $\theta = 1$. Note that even though we always use the same learning rate $\gamma$, the lengths of the steps vary. This is because the actual step length is also affected by the magnitude of the gradient itself, which is different for different $\theta$. Already after 5 steps the value of $\theta$ is fairly close to the minimum:

Gradient descent step 5
Gradient descent step 5

In this simple example we could have computed the optimal value of $\theta$ analytically, making gradient descent optimization unnecessary. It does however serve as a good example of the iterative gradient descent process. For most parametric machine learning models $J$ is far more complicated than in this example and $\theta$ is commonly high-dimensional. In these cases gradient-based optimization techniques like gradient descent are usually the best option for finding a minimum of the objective function.

One question worth asking is whether the gradient descent process described above always will find the global minimum $\hat{\boldsymbol{\theta}}$? It turns out that this is not generally the case. Only if the objective function $J$ is convex and $\gamma$ is chosen to fulfill some specific criteria, gradient descent can be guaranteed to find the global minimum. Many optimization problems encountered in machine learning are non-convex, an this is true in particular when training neural networks. Should we then give up on gradient descent as an optimization method? No, this would be a premature conclusion — as it turns out, in many practical optimization problems it is enough to find a $\boldsymbol{\theta}$ that is a local minimum or even just close to a local minimum. In particular this is the case when training neural networks. This means that gradient descent is indeed a very useful method in practice, even though we do not have any guarantees of finding the optimal solution.

The description of gradient descent given here is the most basic form of the algorithm. There are a number of practical aspects to using gradient descent that are worth knowing about. When working with very large amounts of data it might not be computationally feasible (nor necessary) to compute the exact gradient at each iteration. It is then common to approximate the gradient using only a subset of the data (referred to as a mini-batch). This modified version of the algorithm is called Stochastic Gradient Descent (SGD). Mini-batching is more or less always used when training neural networks. SGD can be extended further by incorporating clever schemes for automatically adjusting the learning-rate $\gamma$ throughout the optimization process. These different extensions are often referred to as optimizers. Some common optimizers used for training neural networks are Adam and RMSprop. You can read more about these in chapter 5.5 of the course book.

Now that you have some understanding of the gradient descent method, we can move on to see how this is applied to training neural networks. You can read about the training procedure in chapter 6.2 of the course book. Finally, read chapter 6.3 about convolutional neural networks, a special type of neural networks specifically designed for image data.

With some understanding of what neural networks are and how they can be trained we will now apply them to two different problems. In the first task you will be solving a regression problem related to the strength of different concrete mixtures. We will then move to the quite challenging task of classifying traffic signs, a central problem in autonomous driving. To solve this task we will make use of convolutional neural networks and more advanced software libraries for deep learning.

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten