关键词 > CSCI433/CSCI933

CSCI433/CSCI933: Machine Learning - Algorithms and Applications Assignment Problem Set #1

发布时间:2023-04-26

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSCI433/CSCI933: Machine Learning - Algorithms and Applications

Assignment Problem Set #1

Due date: Friday April 21, 11:55 p.m.

https://scikit-learn.org/stable/auto examples/linear model/plot sparse logistic regressi mnist .html#sphx-glr-auto-examples-linear-model-plot-sparse-logistic-regression-mnist

-py

Introduction

In this assignment, you will experiment and write a report on building regression models.  Unlike the ordinary regression model that predicts the value of a dependent variable from a collection of independent variables, you will study the logistic regression that can be used for classification. The python library scikit-learn exposes an API that can accomplish this task easily.  You will study the API (see documentation at https://scikit-learn.org/stable/modules/generated/sklearn .linear model .LogisticRegression .html) to gain insight on the various parameters.  A set of book chapters (from G´eron (2019)) has been put together to help you with modifying the codes in scikit-learn documentation to complete the task in this assignment.  The chapter “End-to-End Machine Learning Project” is particularly helpful in preparing data ((G´eron, 2019, chp.2)). Chapter 4, “Training Models” is useful in building various regression models ((G´eron, 2019, Ch. 4)).

The data to be used in this assignment is the publicly available “Wine quality”dataset published by Cortez, Cerdeira, Almeida, Matos, and Reis (2009) for research purposes.  This dataset and its description are provided along with the specifications of this assignment (i.e. this document). You are encouraged to search for the original paper of Cortez et al. (2009) and peruse it. The key task that can be performed with this dataset is classification into the various wine quality scores (they range from 0 to 10).  There is an added twist to this dataset in that the scores are ordered.  For example, 10 is the best and 0 is the worst. This type of problem is referred to as ordinal in nature. It is similar to the type of data you collect when you ask people to indicate their preference in a survey and you provide options,“satisfied, not  satisfied and don't  care as choices.  These are ordered because there is the notion of one choice being better than the another (e.g.  perhaps satisfied  >  don't  care  > not  satisfied).

With this short background information, we can deduce that the problem at hand is an ordinal logistic regression. You will approach this problem in two ways. First, you will use scikit-learn library to build a multinomial logistic regression model. This model does not take the order of the categories into consideration. Unfortunately, scikit-learn does not implement ordinal logistic regression. So, you will use another library (statsmodels - https://www .statsmodels .org/dev/install .html) to build the ordinal logistic regression model. You have been asked to use both scikit-learn and statsmodels so that you gain some experience with the two libraries.

What needs to be done or considered

1. Read the excerpts from the book by G´eron (2019) provided with this specification.  You do not need to read the whole excerpt.  Skim through first and the focus on what is relevant for this Assignment. It has to be emphasised that for those who do not know how to start writing code for this assignment, this excerpt provides ample sample codes to get you started. Use the codes freely and customize them for the task at hand.

2. Understand (visualize/investigate) the dataset using information from Chapter 2.  You will report on your finding.  While there may be no missing data, this dataset may exhibit data imbalance. Report on how you deal with it.

3. In your data preparation, set aside randomly chosen 5 data items. You will use this subset to compare the relative accuracy of the two models you will build.

4. Study the documentation of the scikit-learn logistic regression API. Build a logistic regres- sion model using scikit-learn and report on the best accuracy you are able to obtain.  In building this model explore the use of L1, L2 and elastic-net norms. Report on your results for each norm and explain why the results might be different. Did you notice any difference in the importance of the independent variables? You will also notice the choice of solvers available in the API. Explore them and report on any difference in the results.

5. Study the documentation of the statsmodels ordinal logistic regression API. Build a logistic regression model using statsmodels  (see example at https://www .statsmodels .org/dev/ examples/notebooks/generated/ordinal regression .html) and report on the best accu- racy you are able to obtain.

6. Using the five data items you set aside earlier, compare the two models.

7. Write a report according to the template provided. You MUST follow the template in setting out your sections.  You can have subsections tailored to your presentation style, but the sec- tion headings MUST not be changed.  A LaTeX template has been provided along with this specification. Your report MUST not be more than 6 pages. This includes the title page and references.

8. Your report must cite at least five sources (journal or conference or books) to support the theory section. You must cite the source of the dataset (already provided).

9. Your report must include graphical outputs. However, you need to be judicious in your choice of the plots that you include.  Remember that every graphical plot must have a label and caption, and must be described in the text of your report. Otherwise you will lose substantial marks.

10. It is possible that you will use jupyter notebook to develop your code.  Please note that you cannot submit a notebook file for this assignment. Only a python source code can be submitted (i.e. a .py file).

11. If your source code does not work or emits error messages, your code will not be debugged or fixed. Your report will be marked out of 50% of the total marks for this assignment.

What needs to be submitted

• You will prepare a zip” or “rar” file containing your report (6-page PDF file) and Python code (named : logistic_regression_wine .py) file.

 Your code must run from command line as:

python  logistic_regression_wine .py

and write results indicating that your code works (e.g. classification accuracy for each method) to standard output (stdout).

• Submit the“zip”or “rar”via Moodle dropbox provided on or before the deadline.

References

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J.  (2009).  Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems , 47(4), 547-553.

G´eron, A.  (2019).  Hands- on machine learning with scikit-learn, keras &and tensorflow:  Concepts, tools, and techniques to build intelligent systems (2nd ed.). CA, USA: O’Reilly Media, Inc.