IEOR 142: Introduction to Machine Learning and Data Analytics

Spring 2021


Description:

This course introduces students to key techniques in machine learning and data analytics through a diverse set of examples using real datasets from domains such as e-commerce, healthcare, social media, sports, the Internet, and more. Through these examples, exercises in Python, and a comprehensive team project, students will gain experience understanding and applying techniques such as linear regression, logistic regression, classification and regression trees, random forests, boosting, text mining, data cleaning and manipulation, data visualization, network analysis, time series modeling, clustering, principal component analysis, regularization, and large-scale learning.


Note:

Students cannot receive credit for both IEOR 142 Introduction to Machine 

Learning and Data Analytics and IEOR 242 Applications in Data Analysis.


Instructor:

Prof. Paul Grigas

Industrial Engineering and Operations Research

Email: please use Piazza instead of emailing

Office Hours: TBA


Head Graduate Student Instructor (Head GSI):

Jieqiong (Julie) Wang

M.S. Student, Industrial Engineering and Operations Research

Email: please use Piazza instead of emailing

Office Hours: TBA


Graduate Student Instructors (GSIs):

Shunan Jiang

Ph.D. Student, Industrial Engineering and Operations Research

Email: please use Piazza instead of emailing

Office Hours: TBA


Mo Liu

Ph.D. Student, Industrial Engineering and Operations Research

Email: please use Piazza instead of emailing

Office Hours: TBA


Tor Nitayanont

Ph.D. Student, Industrial Engineering and Operations Research

Email: please use Piazza instead of emailing

Office Hours: TBAIEOR 142, Spring 2021


Lecture:

Tuesday and Thursday 9:30 – 11:00am, Zoom


Discussion Sections/Labs:

Thursday A 4:00 – 5:00pm, Zoom

Thursday B 4:00 – 5:00pm, Zoom

Friday 4:00 – 5:00pm, Zoom


Course Website and Communication Policy:

Announcements, lecture materials, homework assignments, and all other course materials will be posted on the bCourses site. Additionally, we will use a Piazza forum as the main electronic communication method for the course. If you have any questions regarding the course, please post them on Piazza rather than emailing the course staff. If you have a question or concern that is private in nature (i.e., something you would normally send as an email to the course staff), please use a private post on Piazza so that only the course instructor and head GSI (or other GSIs as appropriate) can see your message. You are encouraged to use public posts in situations where other students may benefit from the discussion. In very rare exceptional circumstances where your message should be kept confidential from the GSIs, please email the course instructor and begin the subject line with “[IEOR 142 Confidential]”. In summary, you should observe the following priority list for course related communications:

1. Make a public post on Piazza

2. Make a private post on Piazza that only the course instructor and head GSI (or other GSIs) can see

3. In exceptional circumstances, send an email to the course instructor using “[IEOR 142 Confidential]” to start the subject line.

We ask that you also please observe the following etiquette on Piazza:

1. Do not post answers: Please do not post any answers or your current results on Piazza. Instead, you should explain the key points of your question in a way that allows other students to figure out the essence of the problem on their own. Post problem spoilers after the due date. If you think that your post might give out too much information about the problem solution, then make it private and let the course staff know.

2. No pre-grading: We will not answer any questions of the form “Is this the correct way to solve Homework X, Problem Y?”

3. Aim for public posts: Other students may have the same question, so please try to make your posts public.

4. Formatting: Please format code using the code button and format mathematical equations using the fx button or $$math_equation$$.

5. Piazza is not office hours: Please do not ask questions that are too broad, would require a long time to explain in person, etc. These types of questions should be reserved for office hours.

6. Discussion and collaboration: We encourage you to answer or comment on your fellow students’ posts if you know the answer or would like to discuss.


Prerequisites:

IEOR 165 or equivalent course in statistics. Prior exposure to optimization is helpful but not strictly necessary. Some programming experience/literacy is expected.


Readings/Resources:

The main textbook for this course is:

• An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013. A PDF version of this textbook is available at http://www-bcf.usc.edu/~gareth/ISL/.

All readings are recommended (not required) and will complement the lecture material. Most of the time, the readings will be in a different style and will cover more material than we will be able to cover in lecture. Depending on your learning style, you may find it helpful to complete the readings either before or after the corresponding lecture.

Some other supplementary texts are:

• The Analytics Edge by Dimitris Bertsimas, Allison K. O'Hair and William R. Pulleyblank, Dynamic Ideas LLC, 2016.

• The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, 2009. A PDF version of this textbook is available at https://statweb.stanford.edu/~tibs/ElemStatLearn/. This is an advanced textbook and goes far beyond the material that we will be covering, but this could be a valuable resource for students with a strong mathematical background.


Software:

The course will primarily teach Python, mainly during the discussion sections/labs. In prior semesters this course taught using R. Both languages are quite powerful and flexible for doing machine learning and data science work. Both languages have excellent support for statistical computing and graphics, and the majority of the popular and useful packages in both are free and open source. However, there are some advantages and disadvantages to each in terms of availability of packages, computational efficiency, and ease of writing code.

This is not a programming course, rather software is a tool for us to work with data, build models, and synthesize our results. You are allowed to use any language/software that you want for the homework assignments, midterm, and final project. However, we recommend Python as it is the language that will be taught in the discussion labs and is the only language that is officially supported by the course staff. Prior experience in Python is not assumed. If and when you run into a programming issue, it is highly likely that someone else has run into the same issue in the past. Therefore, we recommend first searching for your problem on Google, Stack Overflow, etc. If you cannot in earnest find a solution online, then we suggest making a post on Piazza.

Instructions will be given during Lab 1 regarding how to get set up in Python. If you are interested in trying out or using R, we recommend that you download both R and RStudio:

https://cran.r-project.org

https://www.rstudio.com

Side Note: My personal preference is to do visualizations and data manipulation in R. A nice reference for doing this type of work in R is:

• R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham and Garrett Grolemund, O'Reilly Media, 2017. An online version of this book is available at http://r4ds.had.co.nz/.


Course Objectives:

1. To expose students to a variety of statistical learning methods, all of which are relevant in useful in wide range of disciplines and applications.

2. To carefully present the statistical and computational assumptions, trade-offs, and intuition underlying each method discussed so that students will be trained to determine which techniques are most appropriate for a given problem.

3. Through a series of real-world examples, students will learn to identify opportunities to leverage the capabilities of data analytics and will see how data analytics can provide a competitive edge for companies.

4. To train students in how to actually apply each method that is discussed in class, through a series of labs and programming exercises.

5. For students to gain some project-based practical data science experience, which involves identifying a relevant problem to be solved or question to be answered, gathering and cleaning data, and applying analytical techniques.

6. To introduce students to advanced topics that are important to the successful application of machine learning methods in practice, include how methods for prediction are integrated with optimization models and modern optimization techniques for large-scale learning problems.


Grading:

There will be a final team project, approximately 5-6 homework assignments, and a midterm held on Gradescope. Grades for the course will be composed as follows:

1. Individual homework assignments: 40%

2. Final project: 35%

3. Midterm exam: 25%


Assignments:

There will be about five or six individual homework assignments assigned during the semester. The tentative homework schedule will be posted on bCourses, and all assignments will be turned in using Gradescope. We will adopt the following slip day policy for late homework submissions:

1. You have a total of 5 slip days that you can use throughout the semester.

2. You can turn in any homework assignment late with no penalty, subject to maintaining your budget of 5 total slip days.

3. Specifically, turning in an assignment 0-24 hours late uses 1 slip day, 24-48 hours late uses 2 slip days, etc.

4. Once you use all 5 slip days, then we will no longer accept late submissions except under extreme circumstances that usually require documentation.

You are strongly encouraged to begin the homework assignments early as they typically involve a significant amount of coding and data analysis.

All homework assignments are individual work assignments. However, some collaboration is allowed and even encouraged. You may find it helpful to discuss broad concepts and general solution procedures with others. If this is the case, then you are enthusiastically encouraged to do so. The objective here is to learn. However, the final product that you turn in must be done individually – it must be your own product, written in your handwriting or typed up in a computer file of which you are the sole author. Copying another's work or code is not acceptable. For each exercise, you should be able to explain your solution approach after turning in the assignment – if this is not the case, then the learning objectives of the assignment have not been met and you will be at a disadvantage for the midterm and the project.

You are expected to adhere to the UC Berkeley Code of Student Conduct at all times. In particular, please give credit to outside sources that you find helpful in completing the assignments. These include your peers or other people who you discuss your work with, other textbooks, material from other courses, etc. (There is no need to cite the course textbooks, slides, or other materials distributed on bCourses.)


Midterm:

There will be a midterm exam administered via Gradescope on Thursday, March 18. The purpose of the exam is to gauge your understanding of the material taught so far. If you have been properly completing all of the homework assignments prior to the exam, then you will already be quite well prepared. More details on the logistics of the exam will be given in class as the exam date approaches.


Final Project:

In lieu of a final exam, there will be a final project that should be done in teams of five students. The final project provides an opportunity for students to apply analytical methods to a problem in a domain of their choosing.

You will gather (and clean up) data relating to your chosen problem, and use the data analytics techniques discussed in class to solve/answer one or more substantive problems or questions. The project will give students some experience in the kind of work that a data scientist might perform in practice. The requirements for the project are outlined below:

• By Friday, March 12 each team must submit a one-page proposal that outlines a plan to apply analytical methods to a problem you identify using some of the concepts and tools discussed in the course. The proposal should include a description of: (1) the problem, (2) the data that you have or plan to collect to solve the problem, (3) which techniques you plan to use, and (4) the impact or overall goal of the project (if you could build a perfect model, what would it be able to do?). The teaching staff will be available to answer questions, and will provide all students with electronic feedback.

• The final project submission will consist of a written report of at most pages (not including appendices) that describes the analysis, as well as a 5-minute presentation (in powerpoint or pdf format) of your project. Project presentations will be recorded and posted on YouTube with a due date of either Thursday, April 29 or during RRR week.

• The four-page report (not including appendices) that describes your analysis is due on the nominal day of the final exam, Wednesday, May 12.


Class Format and Participation:

Lectures will be held remotely via Zoom. Live attendance is not required and we will post video recordings after each lecture. You are encouraged to attend if you are able to but, given the unique constraints of the pandemic, we understand if you are not.

Lecture slides (and any necessary code and data sets) will be posted on the bCourses site prior to lecture if you wish to review them ahead of time. The class schedule (including readings, assignment due dates, and other information) will be periodically updated on bCourses.

Although parts of lecture may be didactic, we will rely upon interactive discussion within the class. In general, questions and comments are encouraged (as long as they are not disruptive). Furthermore, pointing out mistakes and asking “dumb questions” are also encouraged (very often, a large percentage of the class has the same “dumb question”).

Discussion sections/labs, run by the GSIs, will be held via Zoom as well. They will consist of interactive sessions that will cover additional examples of the methods presented in the lectures, and – most importantly – discussion sections will be used to show how to create models in Python. You are encouraged to work interactively on your computer as your GSI runs live coding demos. Discussion section recordings will be posted as well.


Tentative List of Topics (weekly):

See bCourses.