FOUNDATIONS OF MACHINE LEARNING

Course info

Course materials

FAQ

Course materials

/

Section

Predicting the Quality of Wine


Introduction

Portugal is one of the top wine exporters in the world. Vhino verde (“green wine”) is a region in the northwest of Portugal that produces wine with a wide range of flavors. In 2009, Cortez et. al. (2009) published a study where the quality of both red and white vhino verde wines was assessed in order to decide which factors that determine the quality of a wine. The wine quality was measured using a blind test, where each participant graded wines on a scale from 0 (very bad) to 10 (excellent).

In this example you will treat wine quality as a categorical variable and classify red vhino verde wine based on quality, using the dataset from the article. Apart from the wine quality, the following variables are included in the dataset:

  • Fixed acidity [g/dm$^3$]
  • Volatile acidity [g/dm$^3$]
  • Citric acid [g/dm$^3$]
  • Residual sugar [g/dm$^3$]
  • Chlorides [g/dm$^3$]
  • Free sulfur dioxide [mg/dm$^3$]
  • Total sulfur dioxide [mg/dm$^3$]
  • Density [g/cm$^3$]
  • pH [0-14]
  • Sulphates [g/dm$^3$]
  • Alcohol [vol.%]

We retrieve the data from the UCI machine learning repository (Deeru & Graff, 2017).

Since the wine quality is on a scale from 0 to 10, there are a total of 11 classes. The table below shows the distribution over classes for the 1599 observations.

Quality 0 1 2 3 4 5 6 7 8 9 10
n$_{\text{class}}$ 0 0 0 10 53 681 638 199 18 0 0

Observing the table, we can see that no wines in the dataset has a grade in the intervals 0-2 and 9-10. Therefore, you will use only classes 3 to 8 in your model.


Question A: Training, Validation and Test Sets

You will start by examining and pre-processing the data. The first step will be to split the data into training, validation and test sets.

For this specific case, which one of the following is a reasonable data split?


Question B: Splitting the Data

Divide your data into training, validation and test sets with the following code. Use your answer from the previous question to assign values to the part_train and part_validation variables. Both values should be given as decimal numbers in the interval [0, 1]. In addition, shuffle the data inside of the split_data function to ensure that the data split is random. For this purpose, make use of the shuffle function from the scikit-learn library. Let the random_state parameter be 2, to make sure that the shuffling is the same over multiple runs.


Missing Values

Next, you should run the code below to check whether or not there are any missing values in the input data. The check_missing_values function takes a data array as input and prints the position of the missing values, if there are any. More specifically, and based on how the data was loaded, the function looks for NaN entries in the data array.


Question C: Imputing Missing Values

If everything is correct so far, you should have found at least one missing value in the training dataset, but none in the validation and test sets. Next, you will impute (i.e. fill in) the missing values, in order to be able to use them as part of the training set when training the model.

For the imputation, it is reasonable to assume that wines with the same quality are more similar to each other, than wines of different qualities. Therefore, impute each missing value using a class mean. That is to say, for each observation with a missing value, identify which class (wine quality) the wine belongs to. Then, impute the missing value with the mean of the corresponding missing feature, calculated over the data points in that class. Remember to use only the training data for the imputation. Handling missing values can be seen as part of the training, so using validation and test data for this purpose can result in misleading conclusions.

Remark: When imputing a missing value, the original data array will change. To avoid potential errors that might arise from running the code block multiple times we will store the imputed data in an array X_train_imp which is a copy of the training data, whereas X_train will be the original data array (with missing values).

How to best handle missing values depends on the context. In this case, you should have found one missing value belonging to an observation from class 7. Since there was only one missing value and since class 7 is one of the larger classes, a viable option to imputing the value would be to discard the full observation. However, had the missing value originated from one of the smaller classes, there would have been stronger motivation for imputing rather than discarding it. If there had been several missing values and if values were missing in a systematic fashion, you might have wanted to investigate the issue further before deciding on a solution.


Question D: Outliers

You find an observation in your training dataset with a high wine quality but with feature values that are substantially different from those of other high quality wines. You talk to an expert in the field that believes it unlikely that a wine with those qualities should get such a high rank. What is the most reasonable option in this situation?


Question E: Should the Data be Normalized?

The table below shows the minimum and maximum values of each feature in the dataset.

Feature Min Max
Fixed acidity [g/dm$^3$] 4.60 15.9
Volatile acidity [g/dm$^3$] 0.120 1.58
Citric acid [g/dm$^3$] 0.00 1.00
Residual sugar [g/dm$^3$] 0.900 15.5
Chlorides [g/dm$^3$] 0.012 0.611
Free sulfur dioxide [mg/dm$^3$] 1.00 72.0
Total sulfur dioxide [mg/dm$^3$] 6.00 289
Density [g/cm$^3$] 0.990 1.00
pH [0-14] 2.74 4.01
Sulphates [g/dm$^3$] 0.33 2.0
Alcohol [vol.%] 8.4 14.9

Based on the values in the table, do you think that the data should be normalized?


Question F: Normalization

Since the input features seem to be of varying range, it is probably a good idea to normalize the data. Use the MinMaxScaler from the scikit-learn library and the code block below to scale the training and validation input arrays. Remember to use the data array X_train_imp from above for the training data.


Question G: Evaluation Metric

With the aim to create a model that is good at classifying wines based on quality, select one of the following evaluation metrics:

  • Misclassification error
  • Mean squared error
  • Mean absolute error

Based on your choice, fill in the missing lines in the my_metric function such that it takes a target vector (y) and model predictions (y_hat) and returns the metric value. It is okay to assume that the function inputs will be one-dimensional.

Remark: Note that we are in a bit of a grayzone between classification and regression here. The wine quality takes only a few discrete values (3, 4, 5, 6, 7 or 8, to be precise) so we can treat this as a classification problem with six classes. Still, the wine quality is a numerical value (higher is better) so we could also view this as a regression problem. The bottom line is that all the suggested evaluation metrics make sense.


Question H: The Model

Train an MLP using the MLPClassifier from scikit-learn and evaluate it on the validation data, using the function implemented in the previous question. Change the parameters of your model and move on once you are happy with the model performance.

Remark: If you get a warning message when training your model about the algorithm not converging, try to increase the max_iter parameter from its default value.


Question I: Expanding the Dataset

The article from which the red wine dataset originates also presented data on white wines from the region of vhino verde. Assuming that white and red wine share some attributes, let us see if you can improve the performance of your model by adding the white wine data to the training set.

The white wine data consists of 4898 observations in total. However, five of the observations are from class 9. Since there are no class 9 wines in the red wine dataset, we discard these observations.

The code below loads the white wine dataset and merges it with your training data. Fill in the missing lines in the code to train the model using the extra data and evaluate it on the validation dataset.

Did the extra data improve the model’s performance? If you did not observe an improvement in performance or if the improvement was slight, there are several possible explanations. The most obvious explanation is that white and red wines might not be that similar after all. On the other hand, if your model’s performance was not hurt by the additional training data, this indicates that the determinants of white wine quality can not be completely different from those of red wine quality. Another possibility is that, in spite of being similar to red wines, the white wine observations did not provide that much extra information to the training data.

In this example we simply concatenated the trainind data sets for red and white wines, without explicitly telling the model which data points that came from which dataset. Can you think of a (perhaps better) way of combining the two datasets which would make the distinction between red and white wines explicit to the model?


Question J: Evaluation on Test Data

Depending on which model you chose (the one trained only on red wine data or the one trained also on white wine data), uncomment the appropriate lines of code below and run the code block to evaluate the final model on the test data.


Question K: Baseline Model

To evaluate the performance of your classification model, you compare it with a baseline model where all wines are classified as belonging to the largest class in the training dataset, which happens to be class 6 (this is true regardless of whether you use the original or the expanded dataset). Based on the approximate class distribution of the test dataset, given in the table below, what would be the misclassification error of the baseline model? Give your answer as a decimal with three decimal places.

Quality 0 1 2 3 4 5 6 7 8 9 10
Proportion 0 0 0 0.00623 0.0498 0.474 0.324 0.134 0.0125 0 0


The misclassification error of the baseline model is:

If you used misclassification error as your evaluation metric, is your model better than the baseline model? Having a baseline model to compare with, gives an idea of how well your model is performing. If your final model performs better than a model that always predicts the most common class, at least you have some indication that your model has learned something that is of practical use. If the model performs worse than the baseline, on the other hand, this is an indication that the model needs to be improved to be useful in practice. A note of caution, however, as also touched upon in section 4, is that accuracy or misclassification error might not always tell the full story, for instance in the case of imbalanced classes.


Conclusion

That’s it, well done! You have gone through a full procedure of building a machine learning model, from inspecting and curating the data to training and evaluating your model. As mentioned previously in the course, the process of building a machine learning model is often an iterative one. In this case, the performance of the final model might not be satisfactory (even if it might be better than the simple baseline model). As a result, there might be reason to go back to the initial stages of pre-processing data or to reconsider the model selection.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553.

Dheeru, D., & Graff, C. (2017). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.

This webpage contains the course materials for the course ETE370 Foundations of Machine Learning.
The content is licensed under Creative Commons Attribution 4.0 International.
Copyright © 2021, Joel Oskarsson, Amanda Olmin & Fredrik Lindsten