STA302/1001 Methods of Data Analysis 1

Final Project - due April 18, 2021 at 11:59PM EST on Crowdmark


Goal of the Assessment:

The final project is your opportunity to show that you can use the methods and techniques learned from this course appropriately. You will need to use them to build a multiple linear regression model on a given dataset that has properties that allow this model to be good for predictive and inferential purposes but is also easily understood and isn’t overly complicated. This assessment will also give you some experience in writing a report about your statistical analysis, which is a very common task for practicing statisticians.


Instructions:

Using any and all methods and techniques presented in the lecture slides and videos throughout the term, you are tasked with answering the below research question by creating the ‘best’ multiple linear regression model that meets the requirements of the research question. You will then need to write a report (details below) that introduces the research question, outlines the steps in your analysis that you took to reach the ‘best’ model, presents the results of your analysis and describes and justifies the decisions you made, and finally discusses the final model, its interpretation and its limitations in terms of its ability to be easily understood or to provide good inferential results.

You must justify why you decide to work with certain methods, as well as use and interpret the results correctly. You should also ensure that your results and analysis are easily understood by someone with only basic knowledge of statistics. You can again assume that your audience is familiar with the general concepts of hypothesis tests, confidence intervals and p-values, but has no knowledge of linear regression methods. Therefore, for example, if you conduct a hypothesis test on your regression, you should explain what the result of the test means in an intuitive way and in the context of the data itself.


Research Question:

The researcher from Mini Project Part 2 is asking for your help again. They have a new dataset of 1000 California properties that contain the same variables as their first dataset. Now the researcher is interested in knowing which characteristics of California properties combined that best explain the variation observed in median housing value of homes in California. They would like you to help them come up with a multiple linear model which includes the predictors that best explain median housing value and can appropriately be used to predict the median housing value of a new neighborhood. They will also be sharing this model with real estate agents and so the model should be simple enough that it can be easily understood by them. So your proposed model should be complex enough for good predictions and description of the population (with all the right properties), but simple enough that it is easily understood. You will need to make sure that the steps of your analysis are clear and justifiable such that there are no questions about why you chose the model that you present compared to any other possibility.


Dataset:

For this project, you will be using the dataset “housing.csv” for your analysis which can be found on the Quercus project page. This dataset contains information for 20433 California homes on the following variables:

● X = identifier for each observation

● longitude = the longitude where the home/region is located

● latitude = the latitude where the home/region is located

● housing_median_age = the median age of houses in the area of this home

● total_rooms = total number of rooms in the homes in this area

● total_bedrooms = total number of bedrooms in the homes in this area

● population = population of the area where this home is located

 households = number of households in the area this home is located

● median_income = the median income of households in the area (in ten-thousand dollars)

● median_house_value = the median house value in the area where this home is located

● near_bay = indicator of whether the home/region is located near a bay

● near_ocean = indicator of whether the home/region is located near the ocean

● oneh_ocean = indicator of whether the home/region is located within one-hour drive of the ocean

● inland = indicator of whether the home/region is located inland

Even though the dataset contains +20,000 observations, you will only be working with a sample of 1000 of them. You will be required to sample 1000 homes from this dataset using the following sample code:


set.seed(put student number here)

rows <- sample(1:nrow(data), 1000, replace=FALSE)

name your dataset <- data[rows,]


This means that each student will use a unique sample from the dataset. You will therefore need to provide summary statistics/plots to describe each variable in your specific sample of the dataset.


How to present your results:

Once you have decided upon the ‘best’ model to fulfill the goal of the project, you must write up a short scientific report. There should be 4 main sections of your report:

● Introduction section: where you introduce the purpose and relevance of the project

● Methods section: where you describe and explain the methods, tools and techniques used to arrive at your final model (but don’t present any results or data yet)

● Results section: where you present a description of your study sample, important results that led you to make crucial decisions in building your model (following the methods you outline in the earlier section), and the final model and any other important results

● Discussion section: where you interpret your final model and describe why it answers the research question and why it is important, as well as discuss any limitations that still exist based on your results.

You may use tables and plots to help present your results, but they must be relevant and wellthought out so as to convey as much information as possible without being too overwhelming or confusing. When explaining your methods and results, try to avoid just stating that you used a specific method, but add an explanation for why it is the correct tool for the job at hand. See the rubric on the Quercus assignment page for more information regarding the various report components.

If you want more information about how to structure your report and what should be contained in each section, see this cheat sheet and this outline for reports (you may ignore the abstract portion since you do not need one). Note that not all the elements in these resources need to be included in your report. But you can use these to better understand how to structure your submission.

Finally, if you use any external resources outside of the lecture slides, e.g. to give some context about the real estate market, you should include a reference section at the end of your report. You may follow MLA citation styles to help format your references. For some resources on how to cite, see the library page on citations.


Technical Requirements of the Final Report:

Your report should be typed using whatever software you prefer but must be saved and submitted as a PDF file in Crowdmark. Your report must meet the following requirements:

● Font: 12-point font in a style similar to Times New Roman

● Spacing: single-spaced

● Word count: up to a maximum of 1500 words in total (not including captions on figures and tables)

● Number of tables/figures in main report: 5 in total, but you may use any combination of tables and figures

● Figures and table captions: all figures and tables included should include a caption that describes what is being presented (caption not included in work count).

○ Captions should not contain information that is not also discussed in the main report

● Figure properties:

○ All plots should have appropriate title and axis labels

○ A figure may include multiple individual plots but they should be related to each other and make sense as to why they are being presented together

■ Avoid having too many plots in the same figure to ensure that they are legible and clear.

● Appendix: you should have a two-part appendix at the end of your report:

○ Supplementary tables and figures: up to 3 additional tables/figures but they should only be included if they are relevant to the analysis and are referred to in the main text.

○ R code: a cleaned up and complete version of the R code that is used produce your entire report. It should be well-organized and commented appropriately to indicate what each line/section of code is doing.