Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Machine Learning

Programming Assignment II-a

The following assignments will test your understanding of topics covered in the first three weeks of the course. These assignments will count towards your grade and should be submitted through Canvas by 03-12-2023 at 23:59 (CET). You must submit these assignments in groups (as registered on Canvas).  You can get at most 5 points for these assignments, which is 5% of your final grade.

.  Alongside  the  code  for your experiments, you  are  also required to present  a report  summarizing the observations and results of each of the experiments.  For these reports, you can use text and graphs/plots (matplotlib).  You  should place  these report blocks within the Jupyter Notebook in separate text cells. Plots can be appropriately placed near the text explanation.  Your final submission should be a single Jupyter Notebook with code and report blocks.

.  While it is perfectly acceptable to brainstorm and discuss solutions with other colleagues, please do not copy code.  We will check all submissions for code similarity with each other and openly available web solutions.

.  Please ensure that all code blocks are functional before you finalize your submission.  Points will NOT be awarded for exercises where code blocks are non-functional.

Submission

You can submit your solutions within a Jupyter Notebook (*.ipynb).  To test the code, we will use Anaconda Python (3.8).  Please state the names and student ids of the authors at the top of the submitted file.

1    Part One

1.1    Classification on Wine Dataset

In assignment one, you’ve used several regression models to predict the quality of wine.  However, the label quality  is  not  strictly  continuous.   Since  quality  is  an  integer  in  [0,  10],  it  can  also  be  seen  as a discrete, categorical  label.  Consequently,  this problem  can  also be  seen  as  a  multi-class  classification  task.   In  this exercise, we ask you to predict wine quality in a multi-class classification setup using a LogisticRegression classifier.  In less than 50 words, present the potential advantages and drawbacks  of viewing this problem as a multi-class classification task.

2    Part Two

2.1    Implementation  Details

In this assignment, you will work on feature scaling and classification tasks.  Since they have a fixed sequence of execution,  it is required to use the  sklearn Pipeline  functionality to  encapsulate your preprocessing transformations and classification models into a single estimator. In the following assignments, you should perform preprocessing, model fitting, and prediction operations only with a Pipeline estimator.

Any grid search should also be performed on the Pipeline, not on standalone estimators or transforms.

2.2   Data

With this assignment, you will receive two additional files:

.  Data files titled train.csv, test.csv and test label.csv.

The dataset relates to a Portuguese banking institution’s direct marketing campaigns (phone calls).  The data contains  17  features  that  encode  various  parameters  such  as  age,  job,  marital  status,  and  education.  The classification goal is to predict if the client will subscribe to a term deposit The boolean label "y". You  should train  and valid  on the  train.csv,  and test  on  test.csv  and  test  label.csv.             

2.3   Data Preprocessing

Similar to the previous assignment, pandas can help with loading and preprocessing the raw data. For the preprocessing stage of this assignment, you will need to perform the following tasks:

1.  Load the data (CSV) file.

2.  Inspect individual features to ensure they are in the right datatype. Pandas will try to intelligently infer the correct datatype but you still need to inspect the results yourself.

3.  Features that contain categorical data should be converted to a one-hot encoding.  You will find pd.get dummies() or  sklearn.preprocessing.OneHotEncoder helpful  for this task.  Please remember that these operations must be performed on the data before the train/test split.

2.4   Models

In this exercise, you will build pipelines with 2 components:

1.  Feature Scaling:  The range of raw values can vary widely in a dataset.  To bring this variation within the same scale, feature scaling is helpful. For this task, you are asked to experiment with the StandardScaler or MinMaxScaler provided within sklearn to scale your data.

2.  Classification: The second component of your pipeline is a classifier.  In this homework, you are asked  to use  the  LinearSVC,  LogisticRegression  and  KNeighborsClassifier  classifiers.    For   these classifiers,

you must perform the following experiments:

(a)  For a LinearSVC  classifier.

i.  Use GridSearchCV to find an optimal value for the regularization parameter C.

ii.  Fit your classifier on scaled as well as unscaled versions of the data and report the esti- mator scores. Report your observations on the effects of scaling on model performance.

In not more than 50 words, present your observations on the effects of C and feature scaling on model performance.  This explanation should summarize your observations from the ex- periments above. You can use a text box (i.e. Markdown Cell) in Jupyter to write down your analysis.  Feel free to experiment with other hyperparameters as well.

(b)  For  a  LogisticRegression   classifier.

i.  Use GridSearchCV to find an optimal value for the inverse of regularization strength” C.

ii.  Fit your classifier on scaled as well as unscaled versions of the data and report the estimator scores. Report your observations on the effects of scaling on model performance.

In not more than 50 words, present your observations on the effects of C and feature scaling on model performance.  This explanation should summarize your observations from the ex- periments above. You can use a text box (i.e. Markdown Cell) in Jupyter to write down your analysis.  Feel free to experiment with other hyperparameters as well.

(c)  For  a  KNeighborsClassifier   classifier.

i.  Use GridSearchCV to find an optimal value for the “number of neighbors” n neighbors.

ii.  Fit your classifier on scaled as well as unscaled versions of the data and report the esti- mator scores. Report your observations on the effects of scaling on model performance.

In not more than 50 words, present your observations on the effect of n neighbors and feature scaling on model performance.  This explanation should summarize your observations from the experiments above. You can use a text box (i.e. Markdown Cell) in Jupyter to write down your analysis.  Feel free to experiment with other hyperparameters as well.

3    Evaluation Metrics

For each of the pipelines, you must report the following classification metrics

1.  Accuracy

2.  Macro and Micro-Averaged Precision and Recall

3.  F1 Score

Additionally, present your observations on what these scores mean for the models under consideration. These metrics will be discussed at the beginning of Week 5.

4   Grading

Component

Points

Classification on Wine dataset

1

Feature Scaling

1

Classifiers

1

Hyperparam

Optimization

1

Experiments, Observations, Analysis and Code Quality

1