Final Project Instructions

MATH338/STATS438, Spring 2021


Contents

Goal. The goal of the final project is to apply what you’ve learned in this course to conduct a statistical analysis. It should be an in-depth regression analysis of a question that interests you. This question may come from one of your other courses, your research interests, your future career interests, etc.

Final project outline. Find a dataset of your choice. Carry out a regression analysis, taking advantage of what we have learned so far in this course. Write a report, as an R markdown (Rmd) file (preferred). Data can be read by the Rmd file directly from an internet source, copied into the Rmd file, or submitted as an additional file zipped up together with the Rmd file. The deadline is 5pm ET on Wednesday May 5, 2021.

If you do not plan to use R markdown, you need to provide the same set of files in your submission.


Choice of data. Choose a dataset that

interests you,

can be loaded to Rstudio,

multiple main effects and interactions can be explored for your model,

must have at least 100 observations and at least 10 variables,

should include both quantitative and categorical variables.

The dataset should hopefully have at least 100 data points. You can have less, if your interests demand it. Shorter data needs additional care, since model diagnostics and asymptotic approximations become more delicate on small datasets. If your data have more than, say, 1000 data points, you can subsample if you start having problems working with too much data.

You are not permitted to reuse datasets used in examples/homework/labs in class.


The report should contain

Section 1: Introduction

– This should include your research question, hypotheses, and a description of the data. It should also include the exploratory data analysis.

Section 2: Regression Analysis

– This section includes the results of your final regression model. In addition to displaying the model output, you should include a brief description of why you chose that type of model and any interpretations/ interesting findings from the coefficients. You should also include a discussion of the model assumptions and model fit analysis as well as the model diagnostics.

Section 3: Discussion & Limitations

– This section should include any relevant predictions and/or conclusions drawn from the model. Also critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data and appropriateness of the regression analysis should also be discussed here. A paragraph on what you would do differently if you were able to start over with the project or what you would do next if you were going to continue work on the project should also be included.

Section 4: Conclusion

– In this section, you should summarize your project and highlight any final points you wish the reader to get from the project.

Section 5: Additional Work

– This section should include any other models you tried, a check of the assumptions, and a brief explanation of why you didn’t select the model.

Before you finalize your write up, make sure your chunks are turned off by including echo = FALSE in the header of each code chunk. This will hide the R code in the .rmd file of your final write up.

Expectations for the report. The report will be graded on the following categories.

Communicating your data analysis. [1/3 of the total points]

Raising a question. You should explain some background to the data you chose, and give motivation for the reader to appreciate the purpose of your data analysis.

Reaching a conclusion. You should say what you have concluded about your question(s).

You will submit your source code, but you should not expect the reader to study it. If the reader has to study the source code, your report probably has not explained well enough what you were doing.

Statistical methodology. [1/3 of the total points]

Justify your choices for the statistical methodology.

The models and methods you use should be fully explained, either by references or within your report.

Focus on a few, carefully explained and justified, figures, tables, statistics and hypothesis tests. You may want to try many things, but only write up evidence supporting how the data help you to get from your question to your conclusions. Value the reader’s time: you may lose points for including material that is of borderline relevance, or that is not adequately explained and motivated.

Scholarship. [1/3 of the total points]

Your report should make references where appropriate. For a well-written report the citations should be clearly linked to the material. The reader should not have to do detective work to figure out what assertion is linked to what reference.

You should properly acknowledge any sources (people or documents or internet sites) that con-tributed to your project.

When using a reference to point the reader to descriptions elsewhere, you should provide a brief summary in your own report to make it self-contained.

Some resources. Here are some resources with plenty of types of data.

The Data and Story Library (DASL)

UCI Machine Learning Repository

• Other resources such as Articles, Books, Internet,. . .


Plagiarism. If material is taken directly from another source, that source must be cited and the copied material clearly attributed to the source, for example by the use of quotation marks. Failing to do this is plagiarism and will, at a minimum, result in zero credit for the scholarship category and the section of the report in which the plagiarism occurs. Further discussion of plagiarism can be found in On Being a Scientist: A Guide to Responsible Conduct in Research: Third edition (2009), by The National Academies Press. Here is how the Lehigh University - Student Code of Conduct describes plagiarism:

ARTICLE III – Expectations of Conduct – I. ACADEMIC INTEGRITY – B
Plagiarism. This includes but is not limited to:

1. The direct use or paraphrase, of the work, themes or ideas, of another person without full and clear acknowledgement.

2. Submitting the work of another as one’s own in any assignment (including papers, tests, labs, homework, computer assignments, or any other work that is evaluated by the instructor).