Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


STA238 - Summer 2021

Final Project Instructions


Submission Instructions

This is an individual project. You are expected to work on this independently. You are more than welcome to discuss ideas, code, concepts, etc. regarding this assessment with your class mates. Please do NOT share your code or your written text with your peers. It is expected that all code and written work should be written by yourself (unless they are taken from the materials provided in this course or are from a credible source which you have cited). Please note, this project is fairly open, so the context of most of the work completed here should not match your peers.

Your submission will consist of two components:

1. .Rmd file

2. .pdf file


Submission

You will submit one .Rmd file that you created for this project AND the resulting pdf (i.e., the one you ‘Knit to PDF’ from your .Rmd file). These two files must be uploaded into a Quercus assignment by 11:59PM ET, on Monday  August 23rd.

We will be directly marking on the LATEST submission of the .pdf (submitted on/before the due date/time). If your LATEST submission does not contain a .pdf AND an .Rmd then you will receive a 0 on this Project. There will be a short grace period instilled (1 extra hour) to account for technically difficulties. Anything submitted after the grace period will not be accepted. Late projects are NOT accepted. Furthermore, all submissions must be made on Quercus. Email submissions are NOT accepted. Please consult the course syllabus for other inquiries regarding extensions.


Project grading

As mentioned above, this project will be marked based on the output in the pdf submission. You must submit both the Rmd and pdf files for this project to receive full marks in terms of reproducibility. Furthermore, this is an individual project. You are expected to work individually. The workload level is higher than that of an assignment, since this is a project. Thus, it is recommended that you start early.

This project will be graded based off the rubric available on the Assignment Quercus page. TAs will look over each section (on the submitted pdf) and select the appropriate grade for that section based off a coarse overview (one-time read over) of that section (of the pdf). Your project should be well understood to the average university level student after reading it once. I would suggest you make sure your (pdf) document looks clean, aesthetically pleasing, and has been proofread.


Description

In this project you will write a report on a data analysis in which your main methodology will comprise of a collection of techniques taught in STA238 Summer 2021. The methodology must include the following:

● at least one simple linear regression;

● at least one confidence interval (either through a bootstrap or the Z/t approach);

● at least one maximum likelihood estimation (which implies you need a frequentist model);

● at least one hypothesis test;

● at least one Bayesian credible interval (which implies you need a Bayesian model);

● at least EITHER a maximum likelihood estimator derivation (for your maximum likelihood estimate) OR a posterior distribution derivation (for your Bayesian model). Mathematical derivations should go in the appendix.

Please keep in mind that this analysis is for our course. Thus the analysis should be to answer a question about an underlying random process we have data from. You will find some data, form an interesting question and answer the question through your analysis. Your question should be stated clearly so that the reader can quickly identify it in the introduction (and repeated maybe more formally as a hypothesis test in the methods section). In order to showcase all the different methodologies listed above, you may need multiple research questions. However, all the research questions should relate to one another and be of similar topic areas.

There should be no evidence that this is a class project, I should be able to take a screenshot of this and paste it into a newspaper/blog. There should be no raw code. All output, tables, figures, etc. should be nicely formatted.

Make sure that the data is appropriate for your methods. Pick something that is interesting to investigate and has variables appropriate for the methodology you are going to perform. Again, the analysis should be to answer a question about an underlying random process we have data from. Please post on Piazza or email me at [email protected] for clarification on appropriate data.

The material and text on this project should be different from that of your previous assignments in this course. Thus, you should NOT directly copy your previous assignment work. We highly encourage you use feedback from previous assignments to amend/proofread/update your Final Project. If your work is a direct copy of a previous submission or is a direct copy of another person’s submission this is considered an academic offense.


Data

You will be using data from a Statistics Canada survey or census accessed through ODESI: http://odesi2.scholarsportal.info/webview/. ODESI allows users from registered institutions (such as UofT) access data from surveys conducted by Statistics Canada. Detailed instructions on how to navagate the portal, download data, obtain documentation, etc, is provided in Lecture 11. I will provide a demo in class and post instructions in the lecture slides.

You must use data obtained from ODESI for the project. If you use data obtained elsewhere, you will recieve a 0 on the project.

When working with public survey data, there are a few things to consider when cleaning and processing your data. For this reason, I highly recommend that you review tips provided at the end of Lecture 11 to make the process easier for yourself.


Sections of the Report

The report will consist of 8 sections: Abstract, Introduction, Data, Methods, Results, Conclusions, Bibliogra-phy and Appendix.


Abstract

The goal of the abstract is to provide the reader with a summary of the report.

Your Abstract section should include the following:

● One or two sentences describing the introduction.

● One or two sentences describing the data.

● One to three sentences describing the methods.

● One to three sentences describing the results.

● One or two sentences describing the conclusions.


Introduction

The goal of the Introduction section is to introduce the overall “problem” to the reader.

Your Introduction section should include the following:

● Describe the data and the problem in 2-3 clear sentences.

● Should introduce the importance of the analysis.

● Get the reader interested/excited about analysis.

● Provide some background/context explaining the global relevance of the problem/data/analysis.

● Introduce terminology and prep the reader for the following sections.

● Introduce research question.

● Introduce hypotheses.


Data

The goal of the Data section is to introduce the reader to the data set, showcase some meaningful aspects of the data, and get them thinking about potential hypotheses/findings.

Your Data section should include the following:

● A description of the data collection process.

● A summary of the cleaning process (if you cleaned the data).

● A description of the important variables.

● Some appropriate numerical summaries (at minimum center and spread, but something else may be more appropriate). If there are a lot, please put them in a well formatted and labelled table.

● At least 1 aesthetically pleasing plot/graph/figure (No more than 4 plots).

● Text explaining/highlighting each table or figure.

● Some text (and perhaps graphical summaries) of the variables you will perform the bootstrap on (don’t do the bootstrap here - just prep the reader for what is coming in later sections). This should help prep the reader in understanding why the CI is important/interesting and whether it is appropriate.

● In line referencing/text if needed.

● Reference the programming language/software used to complete this section.


Methods

The goal of the Methods section is to introduce the reader to the statistical methods that you will be using to analyze the data.

Your Methods section should include the following:

● A complete explanation of what each methodology you are using entails

● Explain any assumptions.

● An explanation of the parameters of interest (i.e., mean and variance/percentile/etc).

● Justification for your choice of model (justify your choice of distribution, likelihood, prior, etc)

● Any rigorous mathematical computations (i.e., the MLE derivation or the posterior derivation) should go into the Appendix.


Results

The goal of the Results section is to present the results of the statistical analyses to the reader.

Your Results section should include the following:

● The results of the methodologies included in the report.

● An explanation/interpretation of the results.

● Some commentary on whether or not the results seem reasonable.

● Text explaining/highlighting each table or figure.


Conclusions

The goal of the Conclusions section is to present the story of your analysis to the reader.

Your Conclusions section should include the following:

● A brief recap of the hypotheses, methods, and results.

● State (or re-iterate) your key results.

● State any reasonable conclusions drawn from the results.

● An explanation/interpretation of the results.

● Some commentary on any drawbacks/limitations.

● Recommendations for Next Steps for future analyses/reports.


Bibliography

A well formatted biblopgraphy, including references in a well formatted list. These should have been referred to in the text above.


Appendix

The goal of the appendix is to include any supplementary, non-primary information.

Your appendix should include:

● the MLE derivation OR

● the Bayesian posterior distribution derivation


General Notes:

● A standard report would normally not include this much variation in methodology. It is asked here since we want you to display your understanding of the course material.

● Again, this analysis is for our course. The analysis should be to answer a question about an underlying random process we have data from. That means assuming that the data was generated form some distribution, and estimating the parameters.

● Your question should be stated clearly so that the reader can quickly identify it in the introduction (and repeated maybe more formally as a hypothesis test, or some other methodology, in the methods section).

● It is expected that you include at minimum the required methodology in your report, but you can include more (i.e., maybe you want to look at more variables).

● All tables/figures should be well labelled and clean.

● Everything in this project should be written in full sentences/paragraphs.

● There should be no evidence that is a class project, I should be able to take a screenshot of this section and paste it into a newspaper/blog.

● There should be no raw code. Any output should be nicely formatted.

● You will also need a reference section. You should reference the data, any outside code/documentation and any ideas/concepts that are taken outside of the course.

● Note, we are not marking grammar, but we are looking for clarity. It is important that you communicate in a clear and professional manner. I.e., no slang or emojis should appear.

● Use full sentences.

● Be specific. Remember, you are selecting this data and the reader/marker may not be familiar with it. A good principle is to assume that your audience is not aware of the subject matter.

● Remember to end with a conclusion. This means reiterating the key points from your writing.

● Don’t forget to include an appropriate title