Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 1

Assignment description

   Due July 18 at 4pm.

   Do each question in its own R Markdown file and knit to pdf. There are 4 questions so you will need to create 4 .Rmd files and submit 4 .pdf files. Some guidelines for R

Markdown:

   Use headings to indicate which part of the question each answer corresponds to. Specifically, use ## (Heading 2) to indicate what part of a question you are

answering, and ### (Heading 3) to indicate that your answer corresponds to a sub-part of the question. eg. If Question 1 has 2 parts, each with 2 sub-parts, your Markdown file would have the following headings:

## Part 1

Answer to part 1 of the question

### 1 (i)

Answer to part 1 (i) of the question

### 1 (ii)

Answer to part 1 (ii) of the question

## Part 2

Answer to part 2 of the question

### 2 (i)

Answer to part 2 (i) of the question

### 2 (ii)

Answer to part 2 (ii) of the question

   Insert pictures of handwritten work where showing your math is required, using ![Caption](image.jpg) or ![Caption](image.png). Alternatively, if you are familiar

with LateX, you can type up your work directly in Markdown.

   Set seed to 07182022

   Code should be commented within the code chunk so that the reader can follow along with what your code is intended to do

   Figures should be made with ggplot2 and labelled appropriately (including title, axes, etc.)

Marks for assignments will have two components:

1. Overall completion. This is to reward effort.

2. A few questions (or subsections of questions) will be selected for detailed grading. The calculations, code, and explanations in these questions will be graded on

accuracy and clarity. This is to provide feedback to help you in your learning.

Please consider the communication guidelines (in the syllabus) when putting together your assignment.

A few recommendations:

   Checking that your Rmd file knits as expected along the way can help prevent last minute delays.

   Running code line-by-line is helpful for commenting, as well as for troubleshooting.

Submit your assignment

  Help

After you have completed the assignment, please save, scan, or take photos of your work and upload your files to the questions below. Crowdmark accepts PDF, JPG, and PNG file formats.

Q1 (25 points)



Consider the triangle kernel, K(u), plotted below.

 

Part 1

Verify that K(u) is a density function.

Part 2

The triangle kernel is to be used as the basis for a kernel density estimate. As a first step in its construction, for each xi in the data, we construct K ( ). Suppose n = 3,

the data are x1 = 2, x2 = 4, x3 = 7, and use h = 0.3 . Draw the result of this first step. Please use a straight edge to draw your axes and label key features (eg. the height and base of each triangle).

Part 3

Using R, graph the full kernel density estimate with a triangle shape kernel for the above data and bandwidth.

Part 4

Simulate 997 additional data points, drawn from a discrete uniform distribution. That is, Xi unif{1, 10} . Use the function sample in R and graph a histogram and a kernel

density estimate on the same plot.

Part 5

Let’s try with numbers instead of integers. Simulate 1000 data points, drawn from a continuous uniform distribution. That is, Xi ∼ U(1, 10) . Use the function runif in R and graph a histogram and a kernel density estimate on the same plot.


 


Q2 (25 points) 

In Section 2.5 of the supplementary materials for the course, data on Toronto rental housing is analyzed. Download the data file called toronto-apartment-building-                   evaluations.csv from https://github.com/awstringer1/sta238-book/tree/master/data/apartment-data. We will use this data to compare the quality of apartment options around two universities in Toronto—University of Toronto St George campus and Toronto Metropolitan University, in wards 11 and 13 respectively. To measure quality, consider the

variable SCORE.

Part 1: Load and prepare the data.

Part 1 (i)

The variables we need for this analysis are WARD, SCORE, and CONFIRMED_STOREYS. Rename these as ward, score, and storeys. How many observations are there in Ward 11? How

many in Ward 13?

 

Part 1 (ii)

Some observations are missing a score. Remove these observations. How many observations are removed from Ward 11? How many are removed from Ward 13? Compare the

the proportions with missing scores compare between the two wards. What assumptions are we making when we remove properties with missing scores? Hint: Consider what

could be different about the observations that are missing scores from the observations that have scores. Comment on how different proportions of missing data or violations

of the assumptions could change how we understand our results.

Part 2: Numerical Summaries

Part 2 (i)

Calculate the following summary statistics for apartment scores: mean, median, and standard deviation.

 

Part 2 (ii)

Calculate the same statistics, but for apartments in Ward 11 and for apartments in Ward 13. Hint: Try using group_by(ward)


 

Part 2 (iii)

Based on the numerical summaries, how does the quality of apartments in Ward 11 compare to Ward 13? How do they each compare to the quality of apartments overall in

Toronto.

Part 3: Graphical comparisons

Part 3 (i)

Make a histogram & KDE on the same plot. Add vertical lines indicating the mean of Ward 11 and the mean of Ward 13. Hint: This code snippet is an example of how to add two vertical lines: geom_vline(xintercept=c(3, 4), linetype='dashed', color=c('blue', 'red')). Generally, we’d want to label these lines, but for now, simply indicate

which colour is which ward in your answer.

 

Part 3 (ii)

Make an eCDF of apartment scores in Toronto. Add vertical lines indicating the median of Ward 11 and the mean of Ward 13 (same labelling convention as above). Why would

you add a median to the eCDF but a mean to the histogram & KDE?

 

Part 3 (iii)

Make boxplots to compare Ward 11 and Ward 13.

Part 4: Scatterplot

Part 4 (i)

Create a scatterplot comparing storeys and score.

 

Part 4 (ii)

To investigate this relationship, consider the following model:

scorei = α + βstoreysi + Ui

where Ui, i = 1, . . . ,n are iid random variables. What are the parameters of this model? What is usually assumed about Ui for this kind of statistical model?

Part 4 (iii)

Add a linear model to the scatterplot.

Part 5: Conclusions

Comment on the following questions: i. How do Ward 11 and Ward 13 compare to the rest of Toronto? ii. How does Ward 11 compare to Ward 13? iii. Would you prefer to live

in a high rise? iv. What are some limitations of this analysis?



Q3 (25 points)

 

Let Xi Cauchy, i = 1, . . .n be a random sample from a Cauchy distribution. This has E[Xi] = Var(Xi) = ∞ for each i = 1, . . . ,n . Let  =  1Xi .

Part 1

What does the LLN tell us about  for the Cauchy distribution?

Part 2

Adjust the following code to simulate n = 1000 values from a Cauchy distribution that is centred at 0 and has scale parameter 1. Add appropriate labelling to the plot. Run the code in order to plot a histogram of the simulated values and the true density for a Cauchy distribution. Note that has been truncated at (−10, 10).

set.seed(07182022)

 

# Parameters for a Cauchy distribution

mu <- 2

sigma <- 1/3



# Create data for plotting the true density for a Cauchy distribution

x <- seq(-10, 10, by = 0.1)

cauchy_density <- dcauchy(x, location = mu, scale = sigma, log = FALSE)

 

# Simulate n data values from a Cauchy distribution

n <- 400

cauchy_experiment <- rcauchy(n = n,location = mu,scale = sigma)

 

# Plot the true density and a histogram of the simulated values

tibble(x, y=cauchy_density) %>%

ggplot(aes(x=x, y=y)) +

theme_bw() +

geom_histogram(data=tibble(x=cauchy_experiment), aes(x=x, y = ..density..), binwidth=0.5, colour='black', fill = "grey",alp

coord_cartesian(xlim=c(-10, 10), ylim=c(0, 0.35), expand=FALSE) +

geom_line(colour='steelblue', size=1)

Part 3

For a histogram with bins Bj , j = 1, . . . , m , consider the following: Fix some x R , and suppose that for this x, some j ∈ {1, . . . , m} , and ϵn > 0 , Bj = (x − ϵn/2, x + ϵn/2] where ϵn → 0 and nϵn → ∞ as n → ∞ . Then by the Law of Large Numbers, we can state:

 1(x Bj)  f(x)

that is, under these conditions, the histogram converges in probability to the PDF of Xi . Do you expect this to be true for the random sample of the Cauchy distribution, as discussed in this question? Explain why or why not.

Part 4

Run the following code to consider the running average of sample means for n = 1, 2, 3, . . . , 1000 .

runningavg <- cumsum(cauchy_experiment) / 1:length(cauchy_experiment)

tibble(x = runningavg) %>%

ggplot(aes(x = 1:length(cauchy_experiment),y = x)) +

theme_minimal() +

geom_point(pch = ".") +

geom_hline(yintercept = 0,colour="red",linetype="dotted")

 

Part 4 (i)

Recall that E[Xi] = ∞ . Compute E[] (mathematically).

 

Part 4 (ii)

Explain why your answer does, or does not contradict what is observed in this plot. Hint: what would happen if you ran the simulation for a higher n? What would you have

concluded if you had only ran the simulation for n = 100 ?

 

Q4 (25 points)

 

This question deals with the exponential distribution.

Part 1

Assume that the time until a start-up company yields profits is Exponentially distributed with a mean of 2 months (i.e., rate = β = 0.5 ). Moreover, assume (or perhaps,          imagine) that start-ups (and their profits) are independent. Pretend you are a statistical consultant to some one interested in investing in some start-ups. Answer the following

questions in R Markdown, noting when you make use of any theorems:

 

Part 1 (i)

If they invest in 30 start-ups what is the probability that the mean time until realizing profits exceeds 2.033 months (~2 months and 1 day)?



Part 1 (ii)

If they invest in 50 start-ups, what is a good range of time in which you are 95% certain that the mean time until achieving profits will have been achieved?

Part 2

Let Xi Exponential(λ) be an iid random sample from an exponential distribution with rate parameter λ > 0, with density fλ(x) = λe−λx, x > 0 . In this question you will investigate two different estimators of λ . Let  =  1Xi and Mn = Med(X1, . . . ,Xn) .

Part 2 (i)

Prove (mathematically) that E[] = 1/λ and Mn = log(2)/λ .

For the remainder of the question, let T1= 1/ and T2= log(2)/Mn.

Part 2 (ii)

Perform N = 1,000 simulations from the sampling distributions of T1 and T2 based on random samples of size n = 10 and a true value of λ = 2. Visualize the sampling distributions of these two estimators using a histogram and a KDE. Plot a vertical line at λ = 2 using the geom_vline() function.

Part 2 (iii)

Report the mean, standard deviation, bias, and MSE of both estimators, and the relative efficiency of T1 relative to T2. Which estimator do you prefer, and why?