Machine Learning Programming Assignment II-a
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Machine Learning
Programming Assignment II-a
The following assignments will test your understanding of topics covered in the first three weeks of the course. These assignments will count towards your grade and should be submitted through Canvas by 03-12-2023 at 23:59 (CET). You must submit these assignments in groups (as registered on Canvas). You can get at most 5 points for these assignments, which is 5% of your final grade.
. Alongside the code for your experiments, you are also required to present a report summarizing the observations and results of each of the experiments. For these reports, you can use text and graphs/plots (matplotlib). You should place these report blocks within the Jupyter Notebook in separate text cells. Plots can be appropriately placed near the text explanation. Your final submission should be a single Jupyter Notebook with code and report blocks.
. While it is perfectly acceptable to brainstorm and discuss solutions with other colleagues, please do not copy code. We will check all submissions for code similarity with each other and openly available web solutions.
. Please ensure that all code blocks are functional before you finalize your submission. Points will NOT be awarded for exercises where code blocks are non-functional.
Submission
You can submit your solutions within a Jupyter Notebook (*.ipynb). To test the code, we will use Anaconda Python (3.8). Please state the names and student ids of the authors at the top of the submitted file.
1 Part One
1.1 Classification on Wine Dataset
In assignment one, you’ve used several regression models to predict the quality of wine. However, the label quality is not strictly continuous. Since quality is an integer in [0, 10], it can also be seen as a discrete, categorical label. Consequently, this problem can also be seen as a multi-class classification task. In this exercise, we ask you to predict wine quality in a multi-class classification setup using a LogisticRegression classifier. In less than 50 words, present the potential advantages and drawbacks of viewing this problem as a multi-class classification task.
2 Part Two
2.1 Implementation Details
In this assignment, you will work on feature scaling and classification tasks. Since they have a fixed sequence of execution, it is required to use the sklearn Pipeline functionality to encapsulate your preprocessing transformations and classification models into a single estimator. In the following assignments, you should perform preprocessing, model fitting, and prediction operations only with a Pipeline estimator.
Any grid search should also be performed on the Pipeline, not on standalone estimators or transforms.
2.2 Data
With this assignment, you will receive two additional files:
. Data files titled train.csv, test.csv and test label.csv.
The dataset relates to a Portuguese banking institution’s direct marketing campaigns (phone calls). The data contains 17 features that encode various parameters such as age, job, marital status, and education. The classification goal is to predict if the client will subscribe to a term deposit The boolean label "y". You should train and valid on the train.csv, and test on test.csv and test label.csv.
2.3 Data Preprocessing
Similar to the previous assignment, pandas can help with loading and preprocessing the raw data. For the preprocessing stage of this assignment, you will need to perform the following tasks:
1. Load the data (CSV) file.
2. Inspect individual features to ensure they are in the right datatype. Pandas will try to intelligently infer the correct datatype but you still need to inspect the results yourself.
3. Features that contain categorical data should be converted to a one-hot encoding. You will find pd.get dummies() or sklearn.preprocessing.OneHotEncoder helpful for this task. Please remember that these operations must be performed on the data before the train/test split.
2.4 Models
In this exercise, you will build pipelines with 2 components:
1. Feature Scaling: The range of raw values can vary widely in a dataset. To bring this variation within the same scale, feature scaling is helpful. For this task, you are asked to experiment with the StandardScaler or MinMaxScaler provided within sklearn to scale your data.
2. Classification: The second component of your pipeline is a classifier. In this homework, you are asked to use the LinearSVC, LogisticRegression and KNeighborsClassifier classifiers. For these classifiers,
you must perform the following experiments:
(a) For a LinearSVC classifier.
i. Use GridSearchCV to find an optimal value for the regularization parameter C.
ii. Fit your classifier on scaled as well as unscaled versions of the data and report the esti- mator scores. Report your observations on the effects of scaling on model performance.
In not more than 50 words, present your observations on the effects of C and feature scaling on model performance. This explanation should summarize your observations from the ex- periments above. You can use a text box (i.e. Markdown Cell) in Jupyter to write down your analysis. Feel free to experiment with other hyperparameters as well.
(b) For a LogisticRegression classifier.
i. Use GridSearchCV to find an optimal value for the “inverse of regularization strength” C.
ii. Fit your classifier on scaled as well as unscaled versions of the data and report the estimator scores. Report your observations on the effects of scaling on model performance.
In not more than 50 words, present your observations on the effects of C and feature scaling on model performance. This explanation should summarize your observations from the ex- periments above. You can use a text box (i.e. Markdown Cell) in Jupyter to write down your analysis. Feel free to experiment with other hyperparameters as well.
(c) For a KNeighborsClassifier classifier.
i. Use GridSearchCV to find an optimal value for the “number of neighbors” n neighbors.
ii. Fit your classifier on scaled as well as unscaled versions of the data and report the esti- mator scores. Report your observations on the effects of scaling on model performance.
In not more than 50 words, present your observations on the effect of n neighbors and feature scaling on model performance. This explanation should summarize your observations from the experiments above. You can use a text box (i.e. Markdown Cell) in Jupyter to write down your analysis. Feel free to experiment with other hyperparameters as well.
3 Evaluation Metrics
For each of the pipelines, you must report the following classification metrics
1. Accuracy
2. Macro and Micro-Averaged Precision and Recall
3. F1 Score
Additionally, present your observations on what these scores mean for the models under consideration. These metrics will be discussed at the beginning of Week 5.
4 Grading
Component |
Points |
Classification on Wine dataset |
1 |
Feature Scaling |
1 |
Classifiers |
1 |
Hyperparam Optimization |
1 |
Experiments, Observations, Analysis and Code Quality |
1 |
2023-12-04