FOUNDATIONS OF MACHINE LEARNING

Course info

Course materials

FAQ

Course materials

/

Section

Predicting Diabetes Progression


Introduction

Several factors can determine if a person gets a disease or not and how quickly a given disease progresses. In a paper presenting a model selection algorithm called “Least angle regression” (Efron et. al., 2004), the authors make use of model selection as a way to find variables that are relevant for the progression of diabetes. The goal was to find these variables and to be able to predict disease progression one year ahead of time. In this exercise you will use the diabetes dataset from the paper, but instead of identifying relevant variables, you will compare a k-NN model and a linear regression model to decide which is best suited for predicting the progression of diabetes.

The diabetes dataset can be loaded from the datasets library in python using the load_diabetes() function. Each row in the dataset corresponds to one diabetes patient for which the following values have been recorded: age, sex, BMI, average blood pressure as well as six different blood serum measurements. The numerical input features in the dataset have already been scaled and all lie within the range of -0.2 to 0.2. We will omit the categorical variable (sex). The target variable (output) is a numerical measure, ranging from 25 to 346, describing the disease progression one year after the original measurements. The complete dataset consists of 442 observations. From these, we have randomly sampled a set of 353 observations that will be used for cross-validation to choose between a linear regression model and a k-NN model. We have kept the rest of the data points in a separate test set.


Question A: Cross-validation

You will use 10-fold cross-validation to compare a linear regression model and a k-NN model with $k=17$ nearest neighbors (conveniently, someone has already determined that this number of neighbors is near optimal for the k-NN model). In 10-fold cross-validation, we split the training set into 10 batches, and fit the model ten times, each time excluding one of the batches. The excluded batch is then used to evaluate the current model, resulting in an intermediate estimate $E_{\text{hold-out}}^{(\ell)}$ of the expected new data error. Finally, we obtain the k-fold cross-validation error according to

which can be used as an estimate of the expected new data error of the final model trained on the full training set.

What is true about $E_{\text{k-fold}}$?


Question B: Linear Regression Model

You do not have to implement the cross-validation from scratch but will use the cross_val_score function from the scikit-learn library for this purpose.

We will use the squared error function, or rather the mean squared error (MSE), to evaluate the models. The MSE over a dataset (or batch) $\mathcal{B} = \{(x_i, y_i)\}_{i=1}^{n_b}$ is calculated according to

where $\hat{y}_i$ is the model prediction for observation $i$.

For the cross-validation, we can specify this using the scoring parameter of the cross_val_score function. If you read the documentation regarding the scoring parameter, of this function you will find that the closest option available is the negative MSE, which however is easily converted to the standard MSE score.

You will start with the linear regression model. Finalize the code below and run 10-fold cross validation for the linear regression model to obtain a vector of the MSE for the ten training/validation splits.

The number of folds in cross-validation

Selecting the value of $k$ in k-fold cross-validation is a trade-off between computational costs and bias in the estimate of the expected new data error. A larger number of folds means more iterations of repeated training, something that can be practically infeasible if the dataset is large or if the model complexity is high. At the same time, as $k$ is increased, each intermediate training dataset is more closely related to the full dataset, and we can expect the bias to decrease. Although we use $k=10$ in this example, cross-validation with a larger number of folds can be performed in a relatively short time for the simple models considered.


Question C: Expected New Data Error

Use the values obtained from the previous question to calculate the 10-fold cross-validation error as an estimate of the expected new data error of the linear regression model.


Question D: k-NN Model

Repeat the cross-validation process for the k-NN model with $k=17$ using the code below. Fill in the empty lines and run the code to estimate the expected new data error of this model.


Question E: Selecting the Best Model

Based on the cross-validation, which model do you think is most suitable for the task of predicting progression of diabetes?


Question F: The Final Model

Let’s evaluate the chosen model on the test data. Fill in the missing lines in the code below, to train the selected model on the full training dataset and calculate the MSE on the test data.


Conclusion

You have now successfully used cross-validation to select between a linear regression model and a k-NN model for predicting diabetes progression. It is important to note that just because one of the models was the best-performing for this particular case, it does not mean that it will always be the better choice. In all types of machine learning problems, we need to decide upon the best model on a case-by-case basis, depending on the data, the goal of the problem at hand, the available computational resources, and so on.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of statistics, 32(2), 407-499.

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten