STA302 Winter 2026 Final Project Part 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STA302 Winter 2026 Final Project Part 1
Research Proposal and Data Introduction
Due: February 6, 2026, by 11:00PM ET
|
Goal of the Assessment: |
|
Learning Outcomes being Assessed: |
|
• To have the opportunity to work on a topic of |
• |
Apply multiple linear models on various datasets |
|
interest to them and to be creative about this |
|
using R statistical so>ware. |
|
topic. |
• |
Differentiate the relationships modelled using |
|
• To experience the process of conducting a small |
|
qualitative predictors, interactions between |
|
literature review and incorporating knowledge |
|
predictors, and continuous predictors. |
|
gained into analysis. |
• |
Create appropriate residuals plots to evaluate |
|
• To think about whether a research question |
|
model assumptions for a given data set using |
|
and/or a dataset is appropriate for use with |
|
so>ware. |
|
linear regression. |
• |
Recognize distinct paGerns in appropriate residual |
|
• To create a dra> of the components to be |
|
plots and correctly conclude which assumption is |
|
included in an introduction section of a |
|
violated. |
|
report, as well as summary figures and/or |
• |
Report the results of a residual plot analysis and |
|
tables for results section. |
|
recommend a course of action. |
Instruction Summary:
1. Locate open-source data in an area of interest to the group that meets the data requirements listed below. Some examples could be (but are certainly not limited to) sports, medicine, public health, economics, video games, literature, etc. Students/groups will also need to argue for why their dataset is suitable to be used with a linear regression model.
2. Define an explicit research question using the information in that dataset. Note that students/groups will need to argue for why linear regression is appropriate to answer this question with this dataset.
3. Locate three peer-reviewed academic papers related to the specific research question or topic of interest. Students/groups will need to describe how each article relates back to their proposed research question.
4. Select at least 5 variables from the dataset to be predictors in a preliminary multiple linear regression model, with at least one of these five being categorical in nature. These predictors must have been mentioned and summarized in the three academic papers above. The model will then be fit and a complete residual analysis to assess model assumptions will be done.
5. Provide a table that numerically summarizes each variable used in their preliminary model, with an informative caption that highlights any interesting features of the variables (e.g., skews, possible outliers or non-sensical observations, high spread, missing values).
Dataset Requirements:
o Dataset must be open-source and the website where it was found/downloaded from must be provided.
o MUST contain at least 1000 observations (i.e., rows).
o MUST contain 1 response variable suitable for linear regression and at least 9 predictor variables, one of which must be categorical. Categorical variables with multiple levels count as 1 variable here.
o Since at least one predictor will need to be categorical, you may convert one of your numerical variables to categorical if no such variable is available in your downloaded dataset. However, you will need to justify your choice of variable and categorization in the proposal.
o Should NOT be from an educational resource, such as a textbook dataset. If you’re not sure, please ask the instructor or one of the TAs.
o Should NOT be one of the following datasets: Boston Housing dataset or Red Wine Quality dataset.
o If the dataset was found in a data repository (e.g., Kaggle, UCI Repository, etc.), you MUST ensure that your research question is novel and different from the original usage of the data.
Proposal Format:
Your group will create a wriGen proposal that should introduce your research question and data, summarize existing knowledge in that area, fit a preliminary model based on the existing knowledge, and conduct a residual analysis of the model. The proposal must include the following sections and must not exceed the word count in each case:
o Contributions: each group member’s name is listed and a description of their contribution to the proposal is outlined (this does not count towards the word limit).
o Introduction (350 words): introduce the relevance/importance of the topic, state the research question of interest, summarize the results of three peer-reviewed research papers with a focus on their connection to the research question, and describe why linear regression a suitable statistical tool is to answer the research question.
o i.e., why should someone be interested in your project, what are you trying to answer what is already known about this question, and why should you use linear regression.
o Data description (300 words): state where the data was found, explain how the data was originally collected (not how you found the data but how the original curator of the data collected it), describe the response variable (both statistically and with a wriGen description of what it measures and why it meets the requirements for use in a linear model), summarize numerically or graphically (in a single figure/table) each predictor in your dataset that will be used in the preliminary model, and interpret the descriptive statistics in the context of what the predictors measure and how it relates to the research question.
o NOTE: if you had to convert a numerical predictor to a categorical predictor to meet the data requirements, you must justify your choice and the chosen categories in this section.
o Preliminary results (300 words): fit a preliminary model using 5 predictors noted in the literature, conduct a full analysis of the linear regression assumptions noting any viola7ons and what led to your conclusions. Discuss whether your preliminary model results are similar or different to results in the literature and why.
o NOTE: Place residual plots into the document in a grid (i.e., 2-3 plots placed horizontally in a single figure) so that multiple plots will display in a single figure for improved readability (see Resources below).
o Bibliography: an appropriately formaGed list of resources and literature cited in the proposal (not included in work count). APA format is acceptable.
What to Submit:
Only ONE member of the group should submit ALL required submission components. A complete submission to Quercus will include:
✓ Your group’s completed Group Teamwork Agreement, saved as a PDF.
✓ The completed proposal, saved as a PDF.
✓ The Rmd file containing the code used to subset and clean the data, fit the model, produce a summary table, and conduct the residual analysis for checking assumptions.
✓ The original and cleaned (where appropriate) datasets as CSV files, uploaded to a cloud-based storage service (e.g., OneDrive), with the shareable link included as a submission comment on Quercus.
Failure to meet these submission requirements, including incorrect format of components, missing components, and cloud links that do not allow shared access will result in a one-mark deduction on the grade of the proposal.
2026-01-31
Research Proposal and Data Introduction