STA303H1S/STA1002HS Final Project
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STA303H1S/STA1002HS Final Project
Final Project: The final project is due on August 29th, 2020 11:59 PM EDT and consists of a data analysis on a novel dataset. The deadline will be strictly applied. Under no circumstances can students submit late. Please make sure that you start the submission process early so that your project is graded.
Students will be required to demonstrate their understanding of the methods based on course materials by developing a reasonable regression model using the techniques taught in class. The students will be responsible for choosing the correct methods to apply and providing appropriate justifications defending their choices.
The final project will be done individually, and must be typed and submitted by the stated deadline. The project needs to fulfill the following criteria:
● Font: 12-point font in a style similar to Times New Roman.
● Spacing: single-spaced.
● The word limit for the final project is 1500. This excludes the title page, table/figure captions and appendix.
● Maximum 5 tables/figures will be allowed in the project report. The tables and figures should be relevant, and should convey the purpose of the project. All tables and figures should have captions. You may use any combination of tables and figures.
● Up to 3 additional tables/figures in the appendix, but they should only be included if they are relevant to the analysis and are referred to in the main text.
● You must submit the report in a standard file format (e.g., .doc, .docx or a pdf).
● Please submit your R code file. This can be a .r or a .rmd file. No other file format for the code will be accepted.
In order to pass the course, you must submit the final project.
ACADEMIC INTEGRITY: The University treats cases of plagiarism and cheating very seri-ously. It is the students’ responsibility for knowing the content of the University of Toronto’s Code of Behaviour on Academic Matters. All suspected cases of academic dishonesty will be investigated following procedures outlined in the above document. If you have questions or concerns about what constitutes appropriate academic behaviour or appropriate research and citation methods, you are expected to seek out additional information on academic integrity from your instructor or from other institutional resources (see http://academicintegrity.utoronto.ca/). Here are a few guidelines regarding academic integrity:
● You may consult class notes/lecture slides during the final project, however sharing or dis-cussing questions or answers with other students is an academic offence.
● Students must complete all assessments individually. Working together is not allowed.
● Paying anyone else to complete your assessments for you is academic misconduct.
● Sharing your answers/work/code with others is academic misconduct.
● Looking up solutions to test/quiz problems online or in textbooks and copying what you find is an academic offence.
● All work that you submit must be your own! You must not copy mathematical derivations, computer output and input, or written answers from anyone or anywhere else. Unacknowl-edged copying or unauthorized collaboration will lead to severe disciplinary action, beginning with an automatic grade of zero for all involved and escalating from there. Please read the UofT Policy on Cheating and Plagiarism, and don’t plagiarize.
Please do not upload this document or the required dataset on any social media platforms, Chegg, Slideshare or Coursehero. Uploading this document to any such website will be treated as a serious academic offense and we will take actions based on University of Toronto’s policies regarding plagiarism. We will constantly keep an eye in these websites to root out such incidences.
1 Background
Weather forecasting is an excellent application of the methods we have covered in STA303H. Un-fortunately, this type of data tends to be complicated, often riddled with missing values and measurement error. Furthermore, since weather characteristics are measured over time, the data may possess a certain correlation structure. Thus, we must take particular care when attempting to find an appropriate model. Our goal for this dataset can be boiled down to one question: will it rain tomorrow? We will test the model’s predictive ability by observing how well it predicts the most recent year in the dataset.
For your analysis, you will be looking at the weather dataset uploaded on Quercus, which contains weather data for 22 cities in Australia, measured from 2007 to 2017. The data was collected at the day level, but not every city has measurements for every given day, and some cities have fewer overall measurements than others. It includes 23 features representing various weather characteristics, such as rainfall, temperature, and humidity; these are all described in detail below. The dataset has 145,460 total observations.
1.1 Variables
We will use the variable RainTomorrow as our outcome. This variable has two categories: “Yes” if the rain for the next day was 1mm or more, and “No” otherwise. There are many covariates in the dataset; they are described in the table below:
|
Variable
|
Description
|
|
Date
Location
MinTemp
MaxTemp
Rainfall
Evaporation
Sunshine
WindGustDir
WindGustSpeed
WindDir
WindSpeed
Humidity
Pressure
Cloud
Temp
RainToday
|
Date in YYYY-MM-DD format
City in Australia to which data corresponds
Minimum temperature in the 24 hours to 9am (degrees Celsius)
Maximum temperature in the 24 hours to 9am (degrees Celsius)
Precipitation in the 24 hours to 9am (millimetres)
“Class A” pan evaporation in the 24 hours to 9am (millimetres)
Bright sunshine in the 24 hours to midnight (hours)
Direction of strongest gust in the 24 hours to midnight (16 compass points)
Speed of strongest wind gust in the 24 hours to midnight (km/h)
Wind direction averaged over 10 minutes prior to 9am/3pm (compass points)
Wind speed averaged over 10 minutes prior to 9am/3pm (km/h)
Relative humidity at 9am/3pm (percent)
Atmospheric pressure reduced to mean sea level at 9am/3pm (hectopascals)
Fraction of sky obscured by cloud at 9am/3pm (eights)
Temperature at 9am/3pm (degrees Celsius)
“Yes” if the rain for that day was 1mm or more; “No” otherwise
|
2 Task
Identify which characteristics are associated with the outcome variable, and determine their effect (with a reasonable degree of confidence). Furthermore, we would prefer to have a model with good predictive ability, and which does not clearly violate any of the model assumptions. To answer this question, you can use any statistical technique that you learned from the course. However, you need to explain your choice, and it should be clear that you tried out many potential candidates before you made your final choice. You should focus on the following aspects:
1. You are encouraged to include interaction terms if you believe this will increase the predictive ability of your model, or results in a better fit. However, keep in mind that this will make it increasingly difficult to interpret your coefficients.
2. Since this is a prediction problem, you should make a test dataset which you will never use for modelling. In this case, we will use the entire year 2017 as our test data. Therefore, one of the goals is to predict 2017 using the training data, and see how this compares to the observed data.
3. You can fit a GLMM, GLM or GAM (or any other method). However, since this is a lon-gitudinal dataset, you need to explain what assumptions you need to make to fit a GLM or any other model which assumes independence. If you use GLM, then variable selection and prediction become fairly straightforward, which is not trivial for GLMM. GLMM is, however, the most appropriate analysis technique for this data, but due to the large structure of the data GLMMs may take a long time to fit and may not converge. Thus, you need to properly explain how you chose the modelling technique. If you fail to perform certain analyses, then state that clearly in the limitations section.
4. Make sure to perform exploratory data analysis (basic summary statistics, plots, etc.) before moving on to the final modelling.
5. Note that you have a word/figure limit, so you may only be able to showcase at most one or two models in your report. However, you should still discuss all the models you considered along the way, and why you decided on that particular final model.
6. You can do some literature review if that helps.
2021-08-27