Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ESE417 Final Project Report

1 Introduction

1.1 Background

In  ESE  417  this  semester,  we  have  learned  the  theory  and  practice  of machine  learning  and  pattern classification. Topics of this course cover several important supervised and unsupervised machine learning models and methods, including linear model of regression and classification, Perceptron, logistic regression, Bayesian learning methods, and more. In this final project, our group will use Python to apply the learned model to solve the pattern classification problem. We will design, implement and test classification algorithms to achieve the best classification performance on the given data set.

1.2 Dataset description

1.2.1 Dataset Intro

This dataset is related to two kinds of variants of the Portuguese Vinho Verde” wine (e.g., red and white). And due to Privacy and Logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g., there is no data about grape types, wine brand, wine selling price, etc.). The data set is the red wine quality data set from UCI Machine Learning Repository.

Table 1.1

Name

Type

Example

Fixed acidity

Float

7

Volatile acidity

Float

0.27

Citric acid

Float

0.36

Residual sugar

Float

20.7

Chlorides

Float

0.045

Free sulfur dioxide

Float

45

Total sulfur dioxide

Float

170

Density

Float

1.001

PH

Float

3

Sulphates

Float

0.45

Alcohol

Float

8.8

Quality (0- 10)

Int

6

1.2.2 Dataset Overview

Fortunately, in the data cleaning part, there is no N/A (not available) data in each data set. Besides, according to all the numeric data, the data are dense, and all fall within reasonable and acceptable intervals. See Table 1.2.

1.3 Goals

We intend to classify the quality of each decisive metric. During model preparation, our group will prioritize all the determinants. We can classify them into three categories. Our group will perform taxonomic analysis in the following sections to help find internal patterns and relationships.

1.4 Method

In the following modeling, we would like to employ three methods to analyze: Random Forest, Naive Bayes and Artificial Neural Network.

Random forest is an ensemble learning method for classification, regression, and other tasks that operates by building a large number of decision trees at training time. For classification tasks, the output of the random forest is the class chosen by the majority of the trees. For regression tasks, returns the mean or average predicted value of a single tree. Random decision forests correct the habit of decision trees to overfit their training set [1].

Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable. Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value [2].

An Artificial Neural Network (ANN) is an information processing paradigm that is inspired the brain. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data  classification,  through  a  learning  process.  Learning  largely  involves  adjustments  to  the  synaptic connections that exist between the neurons [3].

2 Methods

2.1 Random Forest Algorithm

2.1.1 Review on random forest algorithm

Random Forest further reduces the variance by adding independence to the committee of decision trees. Three level of randomness:

Sampling of the training data with replacement Select split point at random

Select features at random

# Advantages

# random forest (RF) works for models with high variance but low bias

# Better for nonlinear estimators

# RF works for very high-dimensional data, and no need to do feature selection as RF gives the feature importance

# Disadvantage

# Overfitting when the samples are large sized with great noise

# Slow computing performance comparing to single tree

2.2 Naive Bayes Algorithms

The Naive Bayes algorithm is a conditional probability model, and it calculates the posterior probabilities from prior, likelihood, and evidence.  The prior is the probability of the class variable, likelihood is the probability ofthe feature with the given class label, and the evidence is the probability of features.

One important condition in the Naive Bayes algorithm is all features must be mutually independent. Therefore, the probability of a class variable is:

Overall, the Naive Bayes algorithm assigns class labels with the maximum posterior [4].

2.2.2 Implementation

After splitting the dataset into the training set and test set, we fit the Naive Bayes model in the sk-learn package into the training set. There is a total of four Naive Bayes methods used, they are "GaussianNB", "MultinomialNB", "ComplementNB", and "BernoulliNB". We implemented each method in the same way, and test each method using the same test set.

2.3 Artificial Neural Network

2.3.1 ANN Algorithm

Artificial Neural Network uses multi-layer perceptrons to weight each feature value and activation functions to classify class labels.

Figure 2.1. Example structure ofANN model

The input layer transmits input feature values to the hidden layer with weights. The number of nodes in the input layer should be the same as the feature dimension. The hidden layer uses activation functions to scale the values. The hidden layer can be multiple layers, which depend on the best prediction result from the different number of layers. The output layer of the hidden layer will transmit results to the output layer. The number of output layers should be equal to the class labels. The ANN model uses backpropagation to train the model.

2.2.2 Implementation

In the multi-layer perceptron method in the sk-learn package, there are parameters of activation function, solver set, hidden layer size, alpha, max_iter, and learning rate. To obtain the optimal result, we tuned some of the parameters. There is total four activation functions: 'identity', 'logistic', 'tanh', 'relu'. Solver set includes: 'lbfgs', 'sgd', 'adam'. Max layer sizes are from 1 to 100, and max_iter is tuned with numbers of 100 and 200. For each parameter, we set a loop for it to combine it with each value of other parameters. Test the model with all the combinations will release the best result.

3 Results and Analysis

Random Forest is a classifier that uses multiple trees to train and predict samples. This classifier was first proposed by Leo Breiman and Adele Cutler. Our group choose this algorithm as the method for this group project.

3.1.2 Single Decision Tree and Cross Validation

Firstly, we applied the simple decision tree for the wine quality dataset. As we learned in this class, if a model is trained and tested on the same sample labels, the model will score high but perform poorly on experimental data, a condition known as overfitting. To avoid overfitting, a common approach when performing supervised machine learning "experiments" is to keep a portion of the existing dataset as a test set.

When applying the method of cross-validation, it is no longer necessary to divide the validation set, and the test set is always used for the final evaluation of the model. The most basic cross-validation method is K-fold cross- validation, which refers to dividing the training set into k smallest subsets. The easiest way to apply cross_val_score

on a dataset, which is a useful function in the scikit-learn package.

The single decision provides the result of the wine quality as follows.

# Accuracy: 0.4684 (+/- 0.0711)

# Precision: 0.4716 (+/- 0.0560)

# F1 score: 0.4828 (+/- 0.0561)

As shown, the classification result of the decision tree is not quite ideal. In this case, modifying parameters and simple optimization would not be able to classify the dataset better. A more complicated classification model may be more suitable for this dataset, so our group will proceed to implement the classification employing the random forest algorithm.

3.1.3 Default Random Forest

Now, we can run the default random forest algorithm for testing the basic accuracy. The default method in scikit - learn has three main parameters.

# n_estimators: That is, the maximum number of weak learners. The default is 100.

# oob_score: that is, whether to use out-of-bag samples to evaluate the quality of the model. Default is False.

# criterion: that is, the evaluation criterion for the feature when the CART tree is divided. The loss functions of classification models and regression models are different. As can be seen from the above, the important framework parameters of RF are relatively few, and the main thing to pay attention to is n_estimators.

The default random forest algorithm provided a relatively better result comparing to single decision tree. Our group would do some modifications in the following steps to improve the model performance.

#  Accuracy: 0.5710 (+/- 0.0452)

Important Features

The improvement in the split-criterion as feature importance. It is accumulated over all the trees for each variable. See Table 3.1 in appendix.

Table 3.1

3.1.4 Parameter Tuning with Grid Search

In machine learning, the parameters that need to be chosen manually are called hyperparameters. For example, the number of decision trees in a random forest, they need to be specified in advance. If the hyperparameters are not selected properly, the problem of underfitting or overfitting will occur.

One way of hyperparameters tuning is to modify them manually until a good combination of hyperparameters is found, which can be very lengthy, so you can use Scikit-Learn's GridSearchCV to do this search. GridSearchCV guarantees to find the most accurate parameter within the specified parameter range, which requires traversing all possible parameter combinations. That is very time-consuming when dealing with large datasets and multiple parameters. The best-performing parameter is the final result by iterating through all the candidate parameter choices and trying every possibility. The principle is like finding the maximum value in an array, so grid search is suitable for three or four (or less) hyperparameters as a set of hyperparameters [6].

#  n_estimators: choose from 10 to 100 with step size=1

#  Criterion: 'gini' or 'entropy'

#  oob_score(out-of-backage score): Ture or False

Note: Not all samples are used in the generation process of a tree, and the unused samples are called out -of-bag samples. It means whether to use out-of-bag samples to evaluate the goodness of the model or not.

#  Result

criterion: 'gini'

n_estimators: 97

oob_score: False

Figure 3.2

3.1.5 Best parameter model

As every program session suggested, we split the dataset with test set and training set(2:8). The optimized random forest model was implemented to train to the training set. The test result was as follows.

# Classification result for each label

Table 3.2

The table demonstrated that the optimized random forest model gives a better classification result. The accuracy score was calculated by the test set. The training set is about 80 percent of data randomly selected from the dataset and the test set is the rest 20 percent proportion. The model has 65% accuracy. The result is relatively acceptable as desired. However, our group could apply other classification methods or boolean convertion to this dataset.

3.2 Naive Bayes

Naive Bayes is a supervised machine learning method to classify labels based on different features. Since the quality of red wine is based on eleven different features, Naive Bayes can be trained to calculate the posterior probability to predict labels[5].

3.2.1 Test Results

Table 3.3. Max prediction score with layer number of artificial neural network

Type

GaussianNB

MultinomialNB

ComplementNB

BernoulliNB

Accuracy

0.529

0.454

0.496

0.444

Table 3.3 showed the test results from four types ofNaive Bayes algorithms in the sk-learn package. The results came from the model prediction from the test set, after training the model using the train set. The train set is about 80 percent of data randomly selected from the dataset, and the test set is the rest 20 percent propotion. the best result came from Gaussian Naive Bayes, which is 52.9% accuracy. All the other algorithms are under 50% percent. The results are not as optimal as desirable. Therefore, no matter which distribution method is used, the algorithm cannot be trained properly for this dataset.

3.2.2 Discussion

The test results show that Naive Bayes cannot give a high-accurate prediction.  No matter which Naive Bayes method is used, the highest test result was only 0.529. One of the guesses of this undesirable result was that each feature in the dataset was not totally independent with each other. Another guess was that the value of each feature was spread in a large range, few of them have exactly the same value.

For the first guess, each feature in the dataset using Naive Bayes must contribute independently to the probability of results. However, some of the features may have relationships. For example, "residual sugar" may be related to "alcohol". Because the alcohol in wine origin from sugar, the larger the density of sugar in materials, the higher the alcohol concentration will be. If "residual sugar" was lower, the higher "alcohol" might be. In addition, "fixed acidity", "volatile acidity", and "citric acid" are all related to PH value. PH value was the measurement of how acid the liquid was. The more acid in wine, the lower the PH value will be. Therefore, the potential relationship between features may highly affect the Naive Bayes algorithm.

For the second guess, the Naive Bayes calculates the likelihood based on the appearance frequency of features. However, the values of each feature are scattered in a large range. Many different values may relate to the same wine quality. For example, for quality 6, "fixed acidity" can be 11.2, 7.9, 8.9, 7.8, 6.9, etc. Even with a value of 7.9, it can map to several wine qualities. Given a feature value, the model cannot use the posterior probability to predict a precise result. This is the most important reason that this algorithm performs poorly in this dataset.

3.3 Artificial Neural Network

3.3.1 Test Result

As shown in Table 3.4. The performance of ANN was higher than the Naive Bayes algorithm, and about the same as the Random Forest Algorithm. The best prediction result was 62.7% accuracy using the "tanh" activation function, and the "adam" solver set. The result occurred with 77 layers in the hidden layer.

Table 3.4. Max prediction score with layer number ofartificial neural network

identity

logistic

tanh

relu

lbfgs

0.610 (L:83)

0.617 (L:57)

0.614 (L:16)

0.613 (L:25)

sgd

0.519 (L:16)

0.571 (L:61)

0.560 (L:79)

0.569 (L:93)

adam

0.617 (L:86)

0.617 (L:92)

0.627 (L:77)

0.615 (L:11 )

Figure 3.3 Training process using "tanh", and "adam" with layersfrom 1 to 100.

3.3.2 Discussion

One of the problems shown in the figure was that the test result was not linearly related to the layer number. The figure could show a trend that the train set score was increasing as the increase of layers. However, the test set result varied more and more with the increase of layers. Therefore, the hidden layer number cannot efficiently improve the model performance.

3.4.1 Process

Originally, the target value was ranged from 3 to 8 within the subset of all integers. There an interesting thing to do, is aside from using regression modeling, is to set an arbitrary cutoff for the Independent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.

3.4.2 Random Forest for converted dataset

After converting the dataset with boolean labels, we run the same procedure as the previous random forest algorithm. Firstly, we use the default random forest algorithm to test whether this conversion on the dataset is effective or not.

Accuracy: 0.8768 (+/- 0.0406)

As shown, the boolean labeled dataset are more friendly for classification. The grid research process also used to choose the best hyperparameter. The optimized random forest algorithm for boolean dataset has parameters as follows.

Parameter chosen:           ○ criterion: 'entropy' n_estimators: 38, oob_score: True

The optimized random forest model provides a huge improvement for the boolean dataset. The model has 92.8% accuracy, which is relatively high comparing to the single decision tree.

Table 3.5

label

precision

recall

f1-score

support

0

0.92

0.99

0.96

273

1

0.88

0.60

0.71

47

macro avg