COMP 4442: Advanced Probability and Statistics for Data Science
COMP 4442: Advanced Probability and Statistics for Data Science
About This Course
This course has Introduction to Probability and Statistics for Data Science as a prerequisite. It should be taken shortly after Probability and Statistics 1. In addition to gaining mastery of specific statistical methods, students will use their programming skills and growing statistical knowledge to design analyses appropriate to research questions on a variety of data sets.
This is a required course for the Data Science Master's Degree.
Students in this course will
● perform ANOVA analyses.
● fit multiple regression models.
● apply model selection methods and criteria including
o Lasso
o Ridge
o AIC, BIC
o likelihood ratio test, ANOVA test
o Train, validate, test
o Cross validation
● Apply Linear Discriminant Analysis
● Apply Principle Component Analysis
● apply logistic, multinomial, Poisson and negative binomial regression.
● use bootstrap analyses.
● derive sound theoretical footing for the methods, where practical, to guide them as they
o check data requirements,
o apply method diagnostics,
o interpret results,
● Report results of analyses of real-world data using course methods for a variety of audiences,
● Use advanced capabilities of R, a widely used programming language for statistical analysis
Course Meeting Times
The lecture is held 5:00-6:50 Mondays and Wednesdays. The initial lectures will be on Zoom. Meetings can be accessed from the Zoom tab in Canvas.
Contacts
Instructor: Cathy Durso
Office hours: TBA, email, and by appointment.
TA: Cassia Anton
Office hours: TBA
Texts
Hastie, T., Tibshirani, R., Friedman, J. (2009) The Elements of Statistical Learning Data Mining, Inference, and Prediction, ed. 2, Springer
● Download from https://web.stanford.edu/~hastie/Papers/ESLII.pdf
● hardcover ISBN 978-0-387-84857-0,
● ebook ISBN 978-0-387-84858-7
This text is a standard reference in statistical learning. Its presentations include a relatively high level of mathematical detail. The lectures will provide guidance on reading material at this level.
Crawley, M. J. (2005) Statistics: An Introduction Using R, ed. 2, Wiley
ISBN: 978-1-118-94109-6 (An e-book or hard copy will work.)
This text provides a practical introduction to statistical analyses implemented in the R programming language. It does assume a familiarity with data analysis needs. The lectures will provide orientation to these needs, as well as supplementation of the somewhat terse theoretical discussions.
Course Organization
● Problem sets will typically be assigned Thursday to be due the following Wednesday
● Except under extraordinary circumstances, late work will not be accepted.
Technology
You will need a good internet connection and a laptop that meets DU specifications. (See http://www.du.edu/uts/laptops/specs.html .) Course announcements and records of assignments will be maintained through Canvas https://canvas.du.edu/login . For technical support in using Canvas, please go to http://otl.du.edu/knowledgebase/the-canvas-help-menu/ .
The programming assignments will be completed using the R programming language, which may be downloaded fromhttp://cran.r-project.org/ .
The IDE RStudio works well with R and makes use of R markdown particularly simple. RStudio may be downloaded from https://www.rstudio.com/products/rstudio/download/ after you have downloaded R. The free RStudio desktop version is suitable for this course.
Participation in the Zoom lectures requires a good internet connection. Audio can be by telephone, computer audio, or with a headset and microphone.
Grading
Grades in this course will be calculated as follows:
● Problem set solutions 30% total
● Midterm 20%
● Final Project 30%
● Final Exam 20%
Collaboration and Academic Honesty
When you turn in work in this course, you are implicitly agreeing that you have followed the rules for collaboration set forth for that assignment. In general,
● The midterm and final exam must be your own work.
● For problem sets, you may consult with the instructor. You may work individually or in a pair. You may consult with other students, but should credit them. Do not do web searches for solutions.
● For the final project, you may work individually or in a pair.
Students will abide by the honor code.
Guidelines for Problem Sets
Problem sets will typically consist of data analyses. Students should turn in a .doc, .docx, or .pdf discussion of the results, the .R or .Rmd file of the code used to obtain the results, and the .RData workspace in which the results were calculated if necessary for the particular assignment. Students are strongly encouraged to use R markdown to generate the discussion of the results.
2021-03-30
This course has Introduction to Probability and Statistics for Data Science as a prerequisite.