Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STA302 Winter 2026 Final Project Part 1

Research Proposal and Data Introduction

Due: February 6, 2026, by 11:00PM ET

Goal of the Assessment:

 

Learning Outcomes being Assessed:

•       To have the opportunity to work on a topic of

Apply multiple linear models on various datasets

interest to them and to be creative about this

 

using R statistical so>ware.

topic.

Differentiate  the  relationships  modelled  using

•       To experience the process of conducting a small

 

qualitative    predictors,    interactions    between

literature review and incorporating knowledge

 

predictors, and continuous predictors.

gained into analysis.

Create appropriate residuals plots to evaluate

•       To think about whether a research question

 

model assumptions for a given data set using

and/or a dataset is appropriate for use with

 

so>ware.

linear regression.

Recognize distinct paGerns in appropriate residual

•       To create a dra> of the components to be

 

plots and correctly conclude which assumption is

included in an introduction section of a

 

violated.

report, as well as summary figures and/or

Report the  results of a  residual  plot analysis and

tables for results section.

 

recommend a course of action.

Instruction Summary:

1.    Locate open-source data in an area of interest to the group that meets the data requirements listed below. Some examples could be (but are certainly not limited to) sports, medicine, public health, economics, video games, literature, etc. Students/groups will also need to argue for why their dataset is suitable to be used with a linear regression model.

2.    Define an explicit research question using the information in that dataset. Note that students/groups will need to argue for why linear regression is appropriate to answer this question with this dataset.

3.    Locate  three  peer-reviewed  academic  papers  related  to  the  specific  research  question  or  topic  of  interest. Students/groups will need to describe how each article relates back to their proposed research question.

4.   Select at least 5 variables from the dataset to be predictors in a preliminary multiple linear regression model, with at least one of these five being categorical in nature. These predictors must have been mentioned and summarized in the three academic papers above. The model will then be fit and a complete residual analysis to assess model assumptions will be done.

5.    Provide a table that numerically summarizes each variable used in their preliminary model, with an informative caption that  highlights  any  interesting features of the variables  (e.g., skews,  possible  outliers or  non-sensical observations, high spread, missing values).

Dataset Requirements:

o Dataset must be open-source and the website where it was found/downloaded from must be provided.

o MUST contain at least 1000 observations (i.e., rows).

o  MUST contain 1 response variable suitable for linear regression and at least 9 predictor variables, one of which must be categorical. Categorical variables with multiple levels count as 1 variable here.

o Since at least one predictor will need to be categorical, you may convert one of your numerical variables to categorical if no such variable is available in your downloaded dataset. However, you will need to justify your choice of variable and categorization in the proposal.

o Should NOT be from an educational resource, such as a textbook dataset. If you’re not sure, please ask the instructor or one of the TAs.

o Should NOT be one of the following datasets: Boston Housing dataset or Red Wine Quality dataset.

o If the dataset was found in a data repository (e.g., Kaggle, UCI Repository, etc.), you MUST ensure that your research question is novel and different from the original usage of the data.

Proposal Format:

Your group will create a wriGen proposal that should introduce your research question and data, summarize existing knowledge in that area, fit a preliminary model based on the existing knowledge, and conduct a residual analysis of the model. The proposal must include the following sections and must not exceed the word count in each case:

Contributions: each group member’s name is listed and a description of their contribution to the proposal is outlined (this does not count towards the word limit).

Introduction (350 words): introduce the relevance/importance of the topic, state the research question of interest, summarize the  results  of  three  peer-reviewed  research  papers  with  a  focus  on  their  connection  to  the  research question, and describe why linear regression a suitable statistical tool is to answer the research question.

o i.e., why should someone be interested in your project, what are you trying to answer what is already known about this question, and why should you use linear regression.

Data description (300 words): state where the data was found, explain how the data was originally collected (not how you found the  data  but  how  the  original  curator  of  the  data  collected  it),  describe  the  response  variable  (both statistically and with a wriGen description of what it measures and why it meets the requirements for use in a linear model), summarize numerically or graphically (in a single figure/table) each predictor in your dataset that will be used in the preliminary model, and interpret the descriptive statistics in the context of what the predictors measure and how it relates to the research question.

o NOTE:  if  you   had  to  convert  a   numerical  predictor  to  a  categorical   predictor  to   meet  the  data requirements, you must justify your choice and the chosen categories in this section.

Preliminary results (300 words): fit a preliminary model using 5 predictors noted in the literature, conduct a full analysis of the linear regression assumptions noting any viola7ons and what led to your conclusions. Discuss whether your preliminary model results are similar or different to results in the literature and why.

o NOTE: Place residual plots into the document in a grid (i.e., 2-3 plots placed horizontally in a single figure) so that  multiple  plots will display in a single figure for improved  readability (see  Resources below).

Bibliography: an appropriately formaGed list of resources and literature cited in the proposal (not included in work count). APA format is acceptable.

What to Submit:

Only ONE member of the group should submit ALL required submission components. A complete submission to Quercus will include:

✓   Your group’s completed Group Teamwork Agreement, saved as a PDF.

✓   The completed proposal, saved as a PDF.

✓   The Rmd file containing the code used to subset and clean the data, fit the model, produce a summary table, and conduct the residual analysis for checking assumptions.

✓   The original and cleaned (where appropriate) datasets as CSV files, uploaded to a cloud-based storage service (e.g., OneDrive), with the shareable link included as a submission comment on Quercus.

Failure to meet these submission requirements, including incorrect format of components, missing components, and cloud links that do not allow shared access will result in a one-mark deduction on the grade of the proposal.