# FOUNDATIONS OF MACHINE LEARNING

# Course info

# Course materials

# FAQ

Course materials

/

Section

# Predicting Median House Values in California

Every 10 years the United States Census is conducted. The US census counts the number of inhabitants in different areas of the country. This count is used to allocate the seats in the House of Representatives among the different states. The US Census Bureau additionally combines the census data with other statistical indicators, in particular related to economical and social conditions in different areas.

The California Housing dataset is based on the census data for the state of California, collected during the 1990 census (Pace & Barry, 1997).
Of course the Census Bureau does not publish information about single individuals.
Instead the information for a block of houses is aggregated into *block groups*, typically containing 600-3000 people.
This means that a block group both describes a group of buildings at a geographical position and the group of people living in those buildings.
Each data point in the dataset describes one such block group.
In total there are 20 640 data points.
The inputs, or features, in the dataset describe the housing situation and general economic conditions of the block group.
The 8 available features for each block group are, in order, as follows:

- Median income
- Median house age
- Average number of rooms
- Average number of bedrooms
- Population
- Average house occupancy
- Latitude (block group position)
- Longitude (block group position)

Some of these features have gone through some pre-processing, so their values might not correspond exactly to the units we would expect for the feature quantities. The target variable for this task is the median house value of a block group, also available in the dataset. The house value is measured in a unit of 100 000 dollars.

We will now go through the steps of building a linear regression model based on the data. In this example we will use 80% of the data for training, corresponding to 16 512 block groups. The remaining 4128 data points will be used as a test set.

To get started, let us first get acquainted with the variables involved in our problem. Using the same notation as the course book, for the California Housing data, what are the shapes of the following matrices and vectors: $\mathbf{X}$, $\mathbf{y}$, $\mathbf{\theta}$ and $\mathbf{X}^{\top}\mathbf{X}$? Assume that we have already extended each feature vector with a constant 1 in the first position, as described in chapter 3.1 of the book.

We will now make use of these variables to formulate and solve our learning problem.

The goal is to find model parameters $\hat{\boldsymbol{\theta}}$ that best fits our training data.
We will here do this by minimizing the *squared error* cost function

The optimal parameters $\hat{\boldsymbol{\theta}}$ can the be found according to

From a *maximum likelihood* viewpoint, what assumption have we made when using the squared error loss?

In the book we have seen that the solution to finding $\hat{\boldsymbol{\theta}}$ when using the squared error cost function is given by the *normal equations*.
The optimal parameters $\hat{\boldsymbol{\theta}}$ can be computed in closed form as

We will now implement this formula in python.

Fill in the code below to compute the optimal parameter values $\hat{\boldsymbol{\theta}}$.
You will likely find the NumPy functions `transpose`

and `linalg.inv`

useful.

## Solving Systems of Linear Equations

Equations on the form $\mathbf{X} = \mathbf{A}^{-1} \mathbf{B}$ for vectors $\mathbf{X}$, $\mathbf{B}$ and a matrix $\mathbf{A}$ often show up in machine learning methods. The solution to the normal equations in linear regression is one example of this. In that case $\mathbf{X} = \hat{\boldsymbol{\theta}}$, $\mathbf{A} = \mathbf{X}^\top \mathbf{X}$ and $\mathbf{B} = \mathbf{X}^\top \mathbf{y}$.

While the expression is written $\mathbf{X} = \mathbf{A}^{-1} \mathbf{B}$, it is actually possible to find the value $\mathbf{X}$ without explicitly computing the matrix inverse $\mathbf{A}^{-1}$. There are multiple reasons for why this can be beneficial, the main one being that faster and numerically more robust methods for finding $\mathbf{X}$ exist. For Question C we brushed over this fact, partly because in this example $\mathbf{X}^\top \mathbf{X}$ is a small enough matrix that the inverse can be computed quickly and accurately.

The expression $\mathbf{X} = \mathbf{A}^{-1} \mathbf{B}$ really means that ”$\mathbf{X}$ is the solution to the linear systems of equations $\mathbf{A}\mathbf{X} = \mathbf{B}$”. For example, when $\mathbf{X}$ and $\mathbf{B}$ are two-dimensional:

Solving such a linear equation system can be done in multiple different ways, not requiring inverting $\mathbf{A}$.
You might for example be familiar with the method of Gaussian elimination, where a number of row-operations are applied to transform the system into a simpler, equivalent form.
In practice we can simply rely on methods implemented in libraries like NumPy.
In NumPy the method `linalg.solve`

can be used to find $\mathbf{X}$ by solving a linear system.
This is both faster than computing $\mathbf{A}^{-1}$ and has other practical benefits.

It is also quite common that we need to solve systems $\mathbf{A}\mathbf{X} = \mathbf{B}$ many different times for the same $\mathbf{A}$, but with different values of $\mathbf{B}$.
As an example we might want to create multiple linear regression models using the same features $\mathbf{X}$, but trained to predict different target variables $\mathbf{y}$.
In such cases we can speed up the computation even more by techniques tailored for this exact situation.
One can make use of the *LU-factorization* that rewrites the matrix $\mathbf{A}$ in a form that allows for quickly solving linear systems.
More information about the practical aspect of solving linear systems can be found in most books about numerical methods.

Now that we have computed $\hat{\boldsymbol{\theta}}$ we can use these parameters to make predictions. Run the code below to make predictions and compute the mean squared error (MSE) for the training and test sets.

Now consider the data point described by

(including the constant 1 at the first position). This is a block group from Bakersfield in California, present in the training set of our data. The true median house value for the block group is 1.029 ($\times$ 100’000 dollars).

Perform prediction for $\mathbf{x}'$ using the linear regression model with parameters $\hat{\boldsymbol{\theta}}$. What median house value does our model predict for the Bakersfield block group?

According to our linear regression model, what would the median house price of the block group (from question D) be if it was located in central Los Angeles? Let the position $(\text{Latitude}, \text{Longitude}) = (34.06, -118.25)$ represent central Los Angeles. Leave all other features of $\mathbf{x}'$ but the position unmodified.

## Regression Analysis

Fitting a regression model to observational data as in the example above can serve different purposes.
Firstly, the learnt model allows us to *predict* (i.e., estimate/guess) the median selling price for a
block group which is not present in the data. Secondly, the estimated parameters $\hat{\boldsymbol{\theta}}$ can tell us
something about how different features impact the target variable. In this example, they would describe
if and how different properties of a block group impact housing prices in California.
Such an investigative process is called *regression analysis*. It is somtimes argued that the key difference
between machine learning and statistics is that prediction is the primary objective in machine learning, whereas
statisticians are primarily interested in understanding the process that has generated the data (e.g., through
regression analysis). However, this is not a clear-cut distinction. Indeed, prediction is a central concept in classical statistics
as well and, likewise, much effort is being spent by the machine learning community for developing interpretable models
that can be used to better understand the data.

As we saw in question D, the linear regression model is not always that accurate for this dataset.
The basic linear regression model assumes that the relationship between $\mathbf{X}$ and $\mathbf{y}$ is linear, an assumption that is not always true in practice.
Many relationships that we want to model, both in nature and human-made contexts, are indeed highly *non-linear*.
This is likely also the case for the housing values in California.

As we hypothesize that there are non-linear relationships between $\mathbf{X}$ and $\mathbf{y}$, we want to use a regression model that is flexible enough to capture this. To this end we consider a second order polynomial regression model.

The course book describes polynomial regression for one-dimensional $x$. In that case the second order polynomial regression model is given by

In the California housing dataset $\mathbf{X}$ is 8-dimensional (we have 8 different features). We can extend the second order polynomial model to this data by building a second order polynomial for each element in $\mathbf{X}$:

This model has a total of 17 parameters, or equivalently $\boldsymbol{\theta}$ is a 17-dimensional vector. To get an even more flexible model we can also consider second order terms containing two different elements of $\mathbf{X}$, commonly reffered to as interaction terms, for example $x_1 x_2$ and $x_3 x_7$. Such a model can be specified as

This model has 45 parameters. We will here work with this more flexible model.

We can think of a polynomial regression model in two ways. Either as a specific type of regression model, specified by the equations above. Alternatively, we can view it as the combination of a non-linear feature transformation and a normal linear regression model. For the more flexible model we can define a 45-dimensional vector

containing all of our polynomial features. After transforming $\mathbf{X}$ to $\tilde{\mathbf{X}}$, the model itself is just a linear regression model:

This viewpoint of combining a feature transformation with a linear model can come in handy when implementing this in practice.
We will here again make use of scikit-learn to construct both our polynomial features and model.
The `PolynomialFeatures`

class can be used to expand a feature vector $\mathbf{X}$ to polynomial features in the same way that $\tilde{\mathbf{X}}$ was constructed.
A linear regression implementation is also available as the `LinearRegression`

class.
Putting these together we can create a non-linear regression model for the California housing dataset.

The code below uses `PolynomialFeatures`

to construct second order polynomial features for the data.
A linear regression model is then fitted to these new features.
Again, remember that this model will be linear in the features $\tilde{\mathbf{X}}$, but not in our original features $\mathbf{X}$ because of the polynomial transformation.
Finally the mean squared error is computed for both the training and test datasets.
Run the code and compare the MSE on the test set to that of the linear model.

We see that this non-linear model is able to achieve a lower test MSE for the dataset. The MSE for the training data is however even lower and there is a noticeable gap between the value for training and test data. This indicates that the model has started to overfit to the training data. As we made the model non-linear, introducing more parameters, we increased the flexibility of the model. This means that the model is able to pick up on more complicated relationships in the data. However, not all such relationships that the model might find in the training data are neccessarily characteristic for the actual task. There can be relationships between $\mathbf{X}$ and $\mathbf{y}$ in our training set that are simply a result of “noise” in the data. When the model becomes more flexible, it can start to pick up on these non-useful relationships and thus overfit to the training data. It then loses its ability to generalize to new samples, which is our ultimate goal.

One way to counteract overfitting is through regularization. Regularization essentially limits the flexibility of the model in order to improve its ability to generalize. We will now consider one specific kind of explicit regularization, $L^2$-regularization. $L^2$-regularization limits the model flexibility by penalizing large values of $\boldsymbol{\theta}$. This is achieved by adding a penalty term in the linear regression optimization problem. That is, when traing the model we minimize the modified cost function:

The hyperparameter $\lambda$ determines how strong the regularization should be, or in other words how much the flexibility of the model should be limited. The new objective $\tilde{J}$ is then again minimized to find the new optimal parameters. Luckily this optimization problem also has an analytical solution that is straightforward to compute. See chapter 3.3 in the book for details.

To see if we can avoid some of the overfitting behavior, we add $L^2$-regularization to our polynomial regression model from before.
Linear regression (remember that the model is still linear in $\tilde{\mathbf{X}}$) with $L^2$-regression is commonly reffered to as *ridge regression*.
In scikit-learn this is even implemented in its own separate class called `Ridge`

.
We choose $\lambda = 1$ ($\lambda$ is named `alpha`

in the `Ridge`

class) and fit this new model to the California housing training data.
Run the code below to fit and evaluate the complete polynomial regression model with $L^2$-regularization.

Comparing these results to the ones above, we first note that the *training error* is slightly larger. This is
not surprising since we have made the model less flexible by adding the regularization penalty, so it will not be
able to fit the training data as well as the non-regularized model.
However, importantly, the *test error* is smaller for the ridge regression model!
Adding on the regularization seems to have had the desired effect.
When the model flexibility is limited it is not able to adapt to all small variations in the training data.
Instead it will focus on learning the general input-output relationship, which is good for generalization.
Note that the training and test errors are now closer to eachother.

In the end, we are not interested in a model with low error for the training dataset. We should learn a model that generalizes to new samples for $\mathbf{X}$. Regularization is an important tool to be able to control the generalization of our models.

You have now achieved one of the most fundamental tasks in machine learning: training and predicting using a linear regression model.
Hopefully this has also given you an idea of the general concept of parametric models.
Once we have found $\hat{\boldsymbol{\theta}}$ we do not actually need to keep the training data any more.
We could even send the values of $\hat{\boldsymbol{\theta}}$ to someone else and they would be able to make just as good predictions.
This without ever having had access to the training data itself.
*The model is described by the learnt parameter values.*
This is one useful property of parametric models.

This example has introduced the linear regression model (as well as the polynomial regression model, which can be viewed as linear regression applied to a problem with transformed inputs/features). Throughout this course you will learn about a number of other models as well, both linear and non-linear, and parametric and non-parametric. Indeed, you have already encountered k-Nearest Neighbors, which is an example of a non-parametric non-linear model.

Pace, R. K. & Barry, R. (1997). Sparse spatial autoregressions. *Statistics & Probability Letters*, *33*(3), 291-297.

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.

The content is licensed under Creative Commons Attribution 4.0 International.

Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten