Predict Red Wine Quality with SVC, Decision Tree and Random Forest

A Machine Learning Project with Python Code

Red Wine

Table of Content:

  1. Data Wrangling
  2. Data Exploration
  3. Guiding Question
  4. Prepare the Data for Classification Model
  5. Modeling: Baseline Classification, SVC, Decision Tree, and Random Forest
  6. Feature Importance
  7. Conclusion

Dataset:

df.info()

Output:

Dataset Info

Below are the first five rows of the dataset

df.head()

Output:

First Five Rows

Data Wrangling:

df.isnull().sum()

Output:

fixed acidity           0 
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64

No missing Value! We are good to go!

Data Exploration:

import matplotlib.pyplot as pltimport seaborn as snsplt.figure(figsize=(30,20))corr = df.corr()sns.heatmap(corr,annot=True,cmap=sns.diverging_palette(200, 10, as_cmap=True))plt.show()

Output:

Correlation Matrix

From the graph, we can see that alcohol is most strongly correlated with quality, and the correlation is positive. Let’s dive deeper to see the variations in alcohol levels for wines of different qualities using a bar graph.

plt.bar(df['quality'], df['alcohol'])plt.title('Relationship between alcohol and quality')plt.xlabel('quality')plt.ylabel('alcohol')plt.legend()plt.show()

Output:

Bar Graph

Immediately, we know that wines of lower quality tend to have a lower level of alcohol. However, correlation does not indicate a causal relationship. Thus, I want to further investigate the top 3 important properties that can make a wine good (high quality). This investigation requires the use of classification models because the top 3 important properties can classify wines into “good” or “regular” labels.

Guiding Question: Build classification models that can predict wine quality and determine the top 3 important properties that can make a wine good.

Prepare the Data for Classification Model:

I first normalize the dataset. Normalizing the data will transform the data so that its distribution has a uniform range. It’s important to equalize the ranges of the data here because in our dataset citric acid and volatile acidity, for example, have all of their values between 0 and 1. In contrast, total sulfur dioxide has some values over 100 and some values below 10. This disparity in ranges may cause a problem since a small change in a feature might not affect the other. To address this problem, I normalize the ranges of the dataset to a uniform range between 0 and 1.

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler(feature_range=(0, 1))normal_df = scaler.fit_transform(df)normal_df = pd.DataFrame(normal_df, columns = df.columns)print(normal_df.head())

Output:

Normalized Dataset

Next, to make the classification outcomes a bit more direct, I create a new column named “good wine” in the original dataset df. “Good wine” equals “yes” when the quality is equal or above 7. “Good wine” equals “no” when the quality is less than 7. Classification models will finally output “yes” or “no” to predict wine quality.

df["good wine"] = ["yes" if i >= 7 else "no" for i in df['quality']]

Create features X and target variable y. X is all the features from the normalized dataset except “quality”. y is the newly created “good wine” variable from the original dataset df.

X = normal_df.drop(["quality"], axis = 1)y = df["good wine"]

Finally, I want to make sure there is enough “good wine” exists in y.

y.value_counts()

Output:

no     1382 
yes 217
Name: good wine, dtype: int64

Visualize the counts

sns.countplot(y)plt.show()

Output:

Count plot

The result is a bit imbalanced but fair enough. We have over 200 good wines. When we partition our data into the training set and testing set, don’t forget to use stratify = y to ensure the training and testing set have the same portion of “yes” and “no” as the original dataset.

Modeling:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2020, stratify=y)

Baseline Classification:

I use DummyClassifier from sklearn and choose the strategy as “most frequent,” which means the model will always predict the most frequent label in the training set. In other words, this model will always output “no” as its prediction. The accuracy score is 0.86, which is just the proportion of the “no” label in the data.

from sklearn.dummy import DummyClassifierdummy_classifier = DummyClassifier(strategy='most_frequent',random_state=2020)dummy_classifier.fit(X_train,y_train)acc_baseline = dummy_classifier.score(X_test,y_test)print("Baseline Accuracy = ", acc_baseline)

Output:

Baseline Accuracy =  0.8645833333333334

Model One: Support Vector Classifier

“ SVM is a supervised machine learning algorithm that is powerful for classification problems. It relies on a technique named kernel to transform the data, and based on the transformation, it finds an optimal way to separate the data according to the labels.”

1. Fit, Predict, and Accuracy Score:

from sklearn.svm import SVCsvc = SVC(random_state=2020)svc.fit(X_train, y_train)

Next, predict the outcomes for the test set and print its accuracy score.

from sklearn import metricsfrom sklearn.metrics import accuracy_scorey_pred = svc.predict(X_test)print(metrics.accuracy_score(y_test, y_pred))

Output:

SVM Accuracy =  0.8854166666666666

The accuracy score (0.88) of an SVM model is higher than the baseline accuracy (0.86).

2. Overfitting:

However, in our case here, since the wine dataset is relatively small (only 1599 entries), I use cross validation to avoid wasting a large part of the data to validate and test.

Cross Validation (CV) estimates the generalized performance using the same data as is used to train the model. The idea behind cross-validation is to split the dataset up into a certain number of subsets, and then use each of these subsets as held out test sets in turn while using the rest of the data to train the model. (Source: Stack Exchange)

Use cross_val_score function to estimate the expected accuracy of the model on out-of-training data.

from sklearn.model_selection import cross_val_scorescores = cross_val_score(svc, X, y, cv=5)print(scores.mean())

Output:

Cross Validation Score:  0.8642927115987462

Accuracy score on training data

y_pred_train = svc.predict(X_train)print(metrics.accuracy_score(y_train, y_pred_train))

Output:

Training Accuracy:  0.8927613941018767

Since the training accuracy (0.89) and the cross validation score (0.86) are close, we can conclude that no overfitting occurs.

3. Tune the Model’s Parameters:

  • C: regularization parameter
  • kernel: ‘linear,’ ‘poly,’ ‘rbf.’

First, use RandomizedSearchCV to try out a wide range of values [0.001,0.01,0.1,1,10,100,1000] for C. RandomizedSearchCV allowes us to narrow down the range for C.

from sklearn.model_selection import RandomizedSearchCVrandom_grid = {"C": [0.001,0.01,0.1,1,10,100,1000]}svc_random = RandomizedSearchCV(svc,random_grid,cv=5,random_state=2020)svc_random.fit(X_train, y_train)print(svc_random.best_params_)

Output:

{'C': 1}

Now, we have determined that C should be a value around 1. Move on to use GridSearchCV to do an exhaustive search on C within a range around 1, specifically, the range between 0.8 and 1.4 inclusively.

from sklearn.model_selection import GridSearchCVparam_dist = {'C': [0.8,0.9,1,1.1,1.2,1.3,1.4],              'kernel':['linear', 'rbf','poly']}svc_cv = GridSearchCV(svc, param_dist, cv=10)svc_cv.fit(X_train,y_train)print(svc_cv.best_params_)

Output:

{'C': 1.3, 'kernel': 'rbf'}

GridSearchCV helps us find the best parameters: C = 1.3, kernel = rbf. Let’s use the best parameters given to train a new SVM model.

svc_new = SVC(C = 1.3, kernel = "rbf", random_state = 2020)svc_new.fit(X_train, y_train)y_pred_new = svc_new.predict(X_test)print(metrics.accuracy_score(y_test, y_pred_new))

Output:

New SVM accuracy =  0.89375

By tuning the hyperparameters, the performance of the SVM model increase from 0.885 to 0.894.

Model Two: Decision Tree

“ Decision tree is a classification model in the form of a tree structure. It builds through a process known as binary recursive. Decisoin tree splits the data into partitions, and then splits it up further on each of the branches.”

1. Fit, Predict, and Accuracy Score:

from sklearn.tree import DecisionTreeClassifierdt = DecisionTreeClassifier(random_state=2020)dt.fit(X_train, y_train)

Next, predict the outcomes for the test set, plot the confusion matrix, and print the accuracy score.

from sklearn.metrics import plot_confusion_matrixy_pred = dt.predict(X_test)metrics.plot_confusion_matrix(dt, X_test, y_test)plt.show()print(metrics.accuracy_score(y_test, y_pred))

Output:

Confusion Matrix
Decision Tree Accuracy =  0.88125

The performance of a decision tree model (0.881) is poorer than an SVM model (0.894) but better than the baseline classification (0.86).

2. Overfitting:

from sklearn import treeplt.figure(figsize=(40,20))fn = X.columnscn = y.unique()tree.plot_tree(dt, feature_names=fn, class_names=cn, filled=True)plt.show()

Output:

Decision Tree

From the graph, we know that the decision tree is overfitting since it branches exhaustively on the training set. We can further convince ourselves by comparing the accuracy score on the training set and out of training set using the same method as in the SVM model.

scores = cross_val_score(dt, X, y, cv=5)print("Cross Validation Score: ",scores.mean())

Output:

Cross Validation Score:  0.8054917711598746

Accuracy score on training data

y_pred_train = dt.predict(X_train)print(metrics.accuracy_score(y_train, y_pred_train))

Output:

Training Accuracy: 1.0

The cross validation score is only 0.81, which is even lower than the baseline accuracy (0.86), while the training accuracy is 1.0, which means it predicts perfectly for every training data. The decision tree model is for sure overfitting.

To address overfitting, I decide to prune some hyperparameters such as max-depth, max-features and criterion by using GridSearchCV.

3. Tune the Model’s Parameters:

param_dist = {"max_depth": range(1,6),              "max_features": range(1,10),              "criterion": ["gini", "entropy"]}dt_cv = GridSearchCV(dt, param_dist, cv=5)dt_cv.fit(X_train,y_train)print(dt_cv.best_params_)

Output:

{'criterion': 'gini', 'max_depth': 2, 'max_features': 8}

Fit a new decision tree model using the best parameters given above. I have pruned max depth to 2 and max features to 8.

dt_new = DecisionTreeClassifier(criterion = "gini",                                max_depth = 2,                                max_features = 8,                                random_state = 2020)dt_new.fit(X_train, y_train)y_pred_new = dt_new.predict(X_test)print(metrics.accuracy_score(y_test, y_pred_new))scores = cross_val_score(dt_new, X, y, cv=5)print("Cross Validation Score: ",scores.mean())

Output:

New Decision Tree Accuracy:  0.8854166666666666New Cross Validation Score:  0.8786794670846394

After pruning, performance of the decision tree model increases from 0.88 to 0.89. Cross validation score increases from 0.81 to 0.88. More importantly, the overfitting issue is addressed after pruning two hyperparameters (max depth and max features). Let’s visualize the new decision tree.

plt.figure(figsize=(40,20))tree.plot_tree(dt_new, feature_names=fn, class_names=cn, filled=True)plt.show()

Output:

New Decision Tree

From the graph above, we can see that the three features that pop up are alcohol, volatile acidity, and sulphates. This information may serve as a hint of indicating the top three important properties that can make a wine good.

Method Three: Random Forest

“ Random Forest is an ensemble method algorithm that constructs a number of decision tree at training time and outputs the class that is the mode of the classes.”

1. Fit, Predict, and Accuracy Score:

from sklearn.ensemble import RandomForestClassifierrf_model = RandomForestClassifier(random_state = 2020)rf_model.fit(X_train,y_train)

Next, predict the outcomes for the test set and print its accuracy score.

y_pred_rf = rf_model.predict(X_test)acc_rf = accuracy_score(y_test,y_pred_rf)print('Accuracy = ', acc_rf)

Output:

Accuracy =  0.9166666666666666

The performance of a random forest model (0.92) is the best compared to the two above, and it is way better than the baseline.

2. Overfitting:

Use cross_val_score function to estimate the expected accuracy of the model on out-of-training data.

scores = cross_val_score(rf_model, X, y, cv=5)print("Cross Validation Score: ",scores.mean())

Output:

Cross Validation Score:  0.8680466300940439

Accuracy score on training data:

y_pred_train = rf_model.predict(X_train)print(metrics.accuracy_score(y_train, y_pred_train))

Output:

Training Accuracy = 1.0

Even though random forest has corrected for decision tree’s habit of overfitting (to some extent), the disparity between cross validation score and training accuracy here indicates that our random forest model is still overfitting a bit. Similar to decision tree, we can prune some hyperparameters such as max-depth and n_estimators by using GridSearchCV to address overfitting.

3. Tune the Model’s Parameters:

print(rf_model.get_params())

Output:

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 2020, 'verbose': 0, 'warm_start': False}

Try to tune the following hyperparameters:

  • n_estimators: the number of trees in the forest.
  • max depth: the maximum depth of the tree.

First, use RandomizedSearchCV to try out a wide range of values for n_estimators and max depth. I restrict the range of max_depth between 1 and 15 (inclusively) and n_estimators between 100 and 600(inclusively) to simplify the model and solve the overfitting problem.

random_grid = {'max_depth': [1, 5, 10, 15],'n_estimators': [100,200,300,400,500,600]}rf_random = RandomizedSearchCV(rf_model,random_grid, n_iter = 50, cv = 5,random_state = 2020)rf_random.fit(X_train, y_train)print(rf_random.best_params_)

Output:

{'n_estimators': 300, 'max_depth': 10}

RandomizedSearchCV allows us to narrow down the range by telling us that n_estimators should be a value around 300 and max_depth should be a value around 10. Continue to use GridSearchCV to do an exhaustive search on n_estimators within a more specific range [280,480] and on max_depth within a more specific range [7,15].

param_dist = {"max_depth":[7,8,9,10,11,12,13,14,15],      "n_estimators":[280,300,320,350,380,400,420,450,480]}rf_cv = GridSearchCV(rf_model, param_dist, cv=5)rf_cv.fit(X_train,y_train)print(rf_cv.best_params_)

Output:

{'max_depth': 14, 'n_estimators': 450}

Fit a new random forest model using the best parameters given above.

rf_new = RandomForestClassifier(n_estimators = 450, max_depth =  14, random_state = 2020)rf_new.fit(X_train,y_train)y_pred_rf = rf_new.predict(X_test)acc_rf = accuracy_score(y_test,y_pred_rf)print('Accuracy = ', acc_rf)scores = cross_val_score(rf_new, X, y, cv=5)print("Cross Validation Score: ",scores.mean())

Output:

New Random Forest Accuracy = 0.9166666666666666New Cross Validation Score =  0.868669670846395  

After tuning hyperparameters n_estimators and max_depth, the performance of the random forest model remains almost unchanged. However, by increasing n_estimators and decreasing max_depth, we have relieved the problem of overfitting.

Final Model Decision:

  • The final accuracy on testing dataset of the decision tree model is 0.885.
  • The final accuracy on testing dataset of the random forest model is 0.917.

Since the random forest model has the highest accuracy, I choose the random forest model as our final model to use.

Feature Importance:

imp_rf = pd.DataFrame(zip(X_train.columns, rf_model.feature_importances_),columns = ["feature", "importance"])imp_rf.set_index("feature", inplace=True)imp_rf.sort_values(by = "importance", ascending = False, inplace = True)imp_rf.head()

Output:

Feature Importance

As expected, the top 3 important properties are alcohol, volatile acidity, and sulphates. Let’s further create a horizontal bar graph to visualize the feature importances.

imp_rf.plot.barh(figsize=(10,10))plt.show()

Output:

Feature Importance Graph

The graph shows more intuitively alcohol, volatile acidity, and sulphates weigh more than others in predicting wine quality.

Conclusion:

import numpy as npprint(np.average(df[df["good wine"] == "yes"].alcohol))print(np.average(df[df["good wine"] == "no"].alcohol))

Output:

Good Wine =  11.518049155145931 Regular Wine =  10.251037144235408

Thus, wines of good quality have higher levels of alcohol on average.

Compare the difference in sulphates for good wine and regular wine through average

print(np.average(df[df["good wine"] == "yes"].sulphates))print(np.average(df[df["good wine"] == "no"].sulphates))

Output:

Good Wine =  0.7434562211981566 Regular Wine =  0.6447539797395079

Thus, wines of good quality have higher levels of sulphates on average.

Compare the difference in volatile acidity for good wine and regular wine through average

df_good = df[df["good wine"] == "yes"]df_bad = df[df["good wine"] == "no"]print(np.average(df_good["volatile acidity"]))print(np.average(df_bad["volatile acidity"]))

Output:

Good Wine = 0.4055299539170507 Regular Wine = 0.5470224312590448

Thus, wines of good quality have lower levels of volatile acidity on average.

In conclusion, alcohol, volatile acidity, and sulphates are the top 3 important properties that can make a wine good. Good quality wines have higher levels of alcohol on average, lower levels of volatile acidity on average, and higher levels of sulphates on average.

Thanks for reading! 💗

By Jingyi Fang

USC Sophomore Majoring in CSBA