关键词 > CSC246/446

CSC246/446 Machine Learning Project 1

发布时间:2022-09-21

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSC246/446 Machine Learning

Project 1: Hunt the Polynomial

Overview

This assignment is meant to give you practical experience with machine learning style programming.

At an engineering level, you will need to learn to work with datasets, commandline argu- ments, file paths, and numerical programming. At an academic level, you will deepen your understanding of vectors, matrices, loss functions, overfitting vs generalization, and more broadly speaking, the fundamental principles of supervised learning.

Your objective is to implement equation 3.28 from PRML – i.e., to find the best fitting regularized weights for a linear model.  This part should be straightforward - you *must use python* and either numpy or pytorch.  The instructor is more familiar with numpy, so if you choose pytorch, you are in uncharted territory.  For the rest of this document anywhere you see numpy feel free to replace with pytorch.  In either case, your program must be *vectorized* and reasonably efficient. *You may not do any slow single threaded pure python numerical computation loops*. E.g., you should use the vector manipulation routines from numpy. These are implemented in compiled languages which are much more efficient than python, and if you use python for ML, you need to know how to do it efficiently. You should implement equation 3.28 somewhere in your code as a method.

You will be recreating - and extending - the results on polynomial fitting from the first chapter.  I have created various synthetic datasets, each one being created by choosing a particular polynomial, sampling a subset of the x-axis, evaluating the polynomial at each point, and then adding a small amount of zero-mean Gaussian noise. Your objective is to identify for each dataset the best fitting polynomial order that does not yield substantial overfitting.

You should write an additional method that sweeps through each polynomial order (up to a given maximum). For each step, you should then find the best weights and evaluate accuracy (via rmse) and estimate the degree of overfitting. You will then have to develop your own heuristics to identify the best” order as the one with the highest accuracy achievable without significant overfitting.   That is the primary goal of the assignment. Additionally, you must include a readme with a brief discussion of your approach and the results of your method.


Project Requirements

This section details the project requirements in terms of the required API, file name con- ventions, documentation requirements (a report style readme), and data file formats.

API

In order to facilitate automated testing, you must name your program polyhunt.py and your readme should be named readme.txt. Since everyone will be having the same filenames, it is critical that you include your names and UR email’s in the contents of every file, and

implement the ”info”option which prints your name and contact information. Your program must support the following commandline arguments:

m - integer - polynomial order (or maximum in autofit mode)

❼ gamma - float - regularization constant (use a default of 0)

❼ trainPath - string - a filepath to the training data

❼ modelOutput - string - a filepath where the best fit parameters will be saved, if this is not supplied, then you do not have to output any model parameters

❼ autofit - boolean - a flag which when supplied engages the order sweeping loop, when

this flag is false (or not supplied) you should simply fit a polynomial of the given order and parameters.  In either case, save the best fit model to the file specified by the modelOutput path, and print the RMSE/order information to the screen for the TA to read.

❼ info - boolean - if this flag is set, the program should print your name and contact

information (of all members, if working in a team).

You may define additional optional arguments of your own choosing. Some suggestions:

numFolds - the number of folds to use for cross validation

❼ devPath - a path to a held-out data set

❼ debug - a flag to turn on printing of extra information (for debugging)

❼ ... plus any others you wish.


Data Format

The datasets will be in CSV format - two columns of floating point numbers, separated by a comma. The model parameters will be stored in numpy format, making use of commented header rows to store the polynomial order and regularization constant. – If you use pytorch, be sure to budget time to figure out an adapter.

Readme Requirements

Your readme should have the following six sections, in order:

1. Method.  Briefly and specifically describe your autofit method.  If you introduced any custom parameters that are necessary for your autofit, describe them here.

2. Regularization. Did you use it? If you didn’t, probably your code will crash. What values did you try? What value seemed to work the best?

3. Model Stability. Run your program a few different times on the data (or on different subsets of the data) - how does the output change? Ideally you want a method that always picks the same correct order every time.   In practice, there will be some variance. How stable are your predictions? If we run it a few times, will our results agree with yours?

4. Results.  Describe the results of your autofit method.  Does it match the expected outcome on the labeled datasets?  What are the results for the second (unlabeled) group of datasets? Do you believe these results are correct?

5. Collaboration. If you worked solo, please state it. If you worked as a group, please describe one-by-one what each team member’s contributions were to the project. Be as specific as possible.

6. Notes. This section is optional. Include any notes to TAs regarding quirks of your program, additional features, or other comments you believe may be helpful to know when testing and grading.

Support

I am more interested in your overall approach and the overall quality of your work than I am in your particular expression of numerical linear algebra in the code. Essentially I will try to give significant hints in class on the math aspects, but you must do the main loop entirely on your own and out of your own invention. I tried to write these instructions to be helpful guidance, but ultimately the goal of this project is to demonstrate an understanding of the material - in this case, avoiding overfitting while working with regularized linear models.

I will show you how to load/save the models using numpy, and how to use commandline arguments in python. I will try to get you skeleton code for the commandline args, but not sure when I will have an opportunity, so you should try to figure it out on your own as well. We’ll use argparse, it’s pretty straightforward:  https://docs.python.org/3/library/ argparse.html

Any questions about the project should be posted to the blackboard discussion board, under the thread ”Project 1 Questions”. Submission instructions to be released later - we will use a script on the CSUG network.

Academic Honesty

As stated on the syllabus, this course will follow the UofR Academic Honesty policy. Failure to follow the academic honesty policy carries stiff penalties.  A typical result is a zero on the assignment with a further reduction of the overall grade. As is required by the policy, all instances of academic dishonesty will be reported.

Grading

Your project will be graded according to these criteria:

program  correctness – the weights output are correct for the basic version, all

commandline arguments implemented and functional, doesn’t crash, writes correct model file, tested by contents of model file at end and response to commandline arguments

❼ program efficiency – good use of vector operations; no unnecessarily slow or inef-

ficient use of for loops, list comprehensions, or data structures, tested by execution time and code inspection

method accuracy – your estimates are reasonably close to the ground truth (i.e.,

the actual polynomial order), tested on various datasets

method consistency – your program’s autofit results agree with what you claim in

your report

readme – everything I asked for is there, clear and concise, consistent and clean

formatting

Please note: In order to streamline grading, please ensure that your full name and UR email are listed at the top of the redeem and your source files. If you worked with a partner, be sure to list both names in both places.