COMP 4442: Advanced Probability and Statistics for Data Science

About This Course

This course has Introduction to Probability and Statistics for Data Science as a prerequisite. It should be taken shortly after Probability and Statistics 1. In addition to gaining mastery of specific statistical methods, students will use their programming skills and growing statistical knowledge to design analyses appropriate to research questions on a variety of data sets.

This is a required course for the Data Science Master's Degree.

 

Students in this course will

● perform ANOVA  analyses.

● fit multiple regression models.

● apply model selection methods and criteria including

o Lasso

o Ridge

o AIC, BIC

o likelihood ratio test, ANOVA test

o Train, validate, test

o Cross validation

● Apply Linear Discriminant Analysis

● Apply Principle Component Analysis

● apply logistic, multinomial, Poisson and negative binomial regression.

● use bootstrap analyses.

● derive sound theoretical footing for the methods, where practical, to guide them as they

o check data requirements,

o apply method diagnostics,

o interpret results,

● Report results of analyses of real-world data using course methods for a variety of audiences,

● Use advanced capabilities of R, a widely used programming language for statistical analysis

 


Course Meeting Times

The lecture is held 5:00-6:50 Mondays and Wednesdays. The initial lectures will be on Zoom. Meetings can be accessed from the Zoom tab in Canvas.

Contacts

Instructor: Cathy Durso

Office hours: TBA, email, and by appointment.

[email protected]

 

 

TA: Cassia Anton

Office hours: TBA

cassia.anton @du.edu

 


Texts

Hastie, T., Tibshirani, R., Friedman, J. (2009) The Elements of Statistical Learning Data Mining, Inference, and Prediction, ed. 2, Springer

● Download from https://web.stanford.edu/~hastie/Papers/ESLII.pdf  

● hardcover ISBN 978-0-387-84857-0,

● ebook ISBN  978-0-387-84858-7

 

This text is a standard reference in statistical learning. Its presentations include a relatively high level of mathematical detail. The lectures will provide guidance on reading material at this level.

 

Crawley, M. J. (2005) Statistics: An Introduction Using R, ed. 2, Wiley

ISBN: 978-1-118-94109-6 (An e-book or hard copy will work.)

Text website   

 

This text provides a practical introduction to statistical analyses implemented in the R programming language. It does assume a familiarity with data analysis needs. The lectures will provide orientation to these needs, as well as supplementation of the somewhat terse theoretical discussions.

 


Course Organization

● Problem sets will typically be assigned Thursday to be due the following Wednesday

● Except under extraordinary circumstances, late work will not be accepted.


Technology

You will need a good internet connection and a laptop that meets DU specifications. (See http://www.du.edu/uts/laptops/specs.html .)  Course announcements and records of assignments will be maintained through Canvas https://canvas.du.edu/login . For technical support in using Canvas, please go to http://otl.du.edu/knowledgebase/the-canvas-help-menu/  .

The programming assignments will be completed using the R programming language, which may be downloaded fromhttp://cran.r-project.org/ .

The IDE RStudio works well with R and makes use of R markdown particularly simple. RStudio may be downloaded from https://www.rstudio.com/products/rstudio/download/  after you have downloaded R. The free RStudio desktop version is suitable for this course.

Participation in the Zoom lectures requires a good internet connection. Audio can be by telephone, computer audio, or with a headset and microphone.

 


Grading

Grades in this course will be calculated as follows:

● Problem set solutions   30% total

● Midterm                      20%

● Final Project                30%

● Final Exam                  20%


Collaboration and Academic Honesty

When you turn in work in this course, you are implicitly agreeing that you have followed the rules for collaboration set forth for that assignment. In general,

● The midterm and final exam must be your own work.

● For problem sets, you may consult with the instructor. You may work individually or in a pair. You may consult with other students, but should credit them. Do not do web searches for solutions.

● For the final project, you may work individually or in a pair.

Students will abide by the honor code.


Guidelines for Problem Sets

Problem sets will typically consist of data analyses. Students should turn in a .doc,  .docx, or .pdf discussion of the results, the  .R or .Rmd file of the code used to obtain the results, and the  .RData workspace in which the results were calculated if necessary for the particular assignment. Students are strongly encouraged to use R markdown to generate the discussion of the results.