Assignment 2

STA 238 Summer 2021

Due Date: August 2, 2021 at 11:59pm


Instructions

This is an individual assignment. You are expected to work on this independently. You are more than welcome to discuss ideas, code, concepts, etc. regarding this assignment with your class mates. Please do not share your code or your written text with your peers. It is expected that all code and written work should be written by yourself (unless they are taken from the materials provided in this course or are from a credible source which you have cited). Please note, this assignment is fairly open, so the context of most of the work completed here should not match your peers.


How do I hand in this assignment for the August 2st deadline ?

Question 1 and Question 2 have separate submission locations on Quercus.

Submissions must be uploaded to Quercus by 11:59PM EDT, on Monday, August 2nd. Late assignments are not accepted. Please consult the course syllabus for information on the grace period.

We will be directly marking the pdf files, thus please ensure that your final submission looks as you want it to look before submitting it.


Submission for Question 1

Submit your complete .Rmd file that you create for this question AND the resulting pdf (i.e., the one obtained using ‘Knit to PDF’ from your .Rmd file).

As mentioned above, this question will be marked based on the output in the pdf submission.

This question will be graded based off the rubric available on the Assignment Quercus page. TAs will look over each section and select the appropriate grade for that section based off a coarse overview (one-time read over) of that section. Your assignment should be well understood to the average university level student after reading it once. I would suggest you make sure your document looks clean, asthetically pleasing, and has been proofread. You will be able to see the rubric grade for each section. There may be some comments/feedback provided (by the TAs) if the same issue seems to be arising in multiple sections, but you will likely receive no comments/feedback (due to the scaling of the class and marking).


Submission for Question 2

Question 2 can be hand-written or typed. Submit Question 2 as a single PDF file


Question 1: Write-up on Toronto Open Data (44 Marks)

In Assignment 1, you produced a write-up with an exploratory data analysis of a dataset. In a statistical report, that write-up would similar to the ‘Data’ section – where you introduce the data, and show key numerical/graphical summaries. Following the ‘Data’ section is typically the ‘Model’ and ‘Results’ section. The ‘Model’ section introduces and describes the statistical model that you will be using. The ‘Results’ section provides the final estimates of model parameters, interpretation of estimates, and other remarks.

In this question you will create a “Model” (or “Methods”) section and a “Results” section of a report. This will allow you to propose TWO suitable models for a dataset, estimate parameters in that model, and summarize a meaningful aspect of the data.

You can refer to Lecture 7 for examples of how to propose and fit appropriate models. So far, we have introduced 3 types of models:

● univariate frequentist models

● univariate Bayesian models

● linear regression models

Using the 3 types of models listed above, you will propose TWO models, each of different types. For example, you might propose 1 Bayesian model AND 1 linear regression model.

You are required to find data through the Toronto Open Data Portal (https://open.toronto.ca/). You can pull the data using the opendatatoronto package OR download the data and then read it into R.

Feel free to use the same data as you did on Assignment 1, just make sure that the data is appropriate for your model.

The goal of the Model section is to introduce the reader to the model (or statistical methods) that you will be using to analyze the data.

Your Model section should include the following:

● Formally state the mathematical models

● An explanation of the models for a general science reader (i.e., not a statistician).

● A description of why the models are appropriate (based off assumptions, variable types and practical rationale).

Your Results section should include the following:

● A clear statement of the parameter estimates. (Perhaps in a table if you have multiple parameters)

● An explanation/interpretation of the results.

● Some commentary on whether or not the results seem reasonable (based on prior knowledge, common sense, nature of the data, etc.).

● Text explaining/highlighting each table or figure.


Finding appropriate data

It may take a few tries to find data that is appropriate for a particular type of model. If needed/preferred, you may use two different datasets (one for each of your two models).

In the case that you use 2 datasets and the topic areas of dataset #1 and dataset #2 are completely different, you can create different ‘Model’ and ‘Results’ sections for each dataset. That is, you can create shorter ‘Model’ and ‘Results’ sections using dataset #1, followed by additional shorter ‘Model’ and ‘Results’ sections using dataset #2.

Whatever you choose, please ensure that your work is organized and easy to follow with appropriate headings. The easier it is for the TAs to figure out what you are doing, the happier the TAs will be while marking your work.


General Notes (for Question 1):

● All tables/figures should be well labelled and clean.

● Everything in Question 1 should be written in full sentences/paragraphs.

● There should be no evidence that Question 1 is an assignment, I should be able to take a screenshot of this section and paste it into a newspaper/blog.

● There should be no raw code, error messages, or warning messages. Any output should be nicely formatted.

● Note, we are not marking grammar, but we are looking for clarity.

● Use full sentences.

● Grammar is not the main focus of the assessment, but it is important that you communicate in a clear and professional manner. I.e., no slang or emojis should appear.

● If you are writing your report directly in R Studio, you can check your spelling using Edit > Check Spelling

● Be specific. Remember, you are selecting this data and the reader/marker may not be familiar with it. A good principle is to assume that your audience is not aware of the subject matter.


Question 2: Comparing the Frequentist and Bayesian Normal Models (22 Marks)

Scores on IQ tests are designed to follow a normal distribution with a mean of 100 and a standard deviation of 15 when applied to the general population.

Suppose we sample n people in a particular town and administer IQ tests for each person in our sample in order to estimate µ, the town-specific IQ score. Let Yi be the IQ score for the i th person in the sample.

For this question, you can assume that the standard deviation of IQ scores in this town is 15 2 = 225), the number of observations is 20 (n = 20), and y¯20 = 110.

Finally, suppose that the true value of µ is 112. Of course this is usually the unknown parameter of interest, but for the sake of this question it will assume this value.

For this question you can use any result that we’ve discussed in class. You do not need to derive any result that we’ve covered in class.

a) First, let’s use a normal model within the frequentist framework with a maximum likelihood estimator. So, the model is

Y1, ..., Y20 ~ N(µ, σ2 = 225)

State the maximum likelihood estimate ˆµML. (1 mark)

b) Compute the value of Bias(ˆµML), V ar(ˆµML) and MSE(ˆµML). (3 marks)

c) Determine the sampling distribution for ˆµML. Show and justify your steps. (2 marks)

d) Now, let’s use a normal model within a Bayesian framework. We will assume the following model:

Y1, ..., Y20|µ ~ N(µ, σ2 = 225)

µ ~ N(100, 225)

State posterior distribution of µ and compute the Bayesian point estimate ˆµBayes. (4 marks)

e) Compute the value of Bias(ˆµBayes), V ar(ˆµBayes) and MSE(ˆµBayes). Show and justify your steps. (7 marks)

f) Determine the sampling distribution for ˆµBayes. Show and justify your steps. (3 marks)

g) Compare the bias and MSE of ˆµML and ˆµBayes. (2 marks)