闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Assignment 3

2022

Instructions

1) Please submit your solutions to this assignment in one PDF ﬁle in brightspace. Only one ﬁle will be accepted.

2) You can submit a PDF ﬁle more than once. However, only the last submission will be saved. If you want to modify your submitted assignment, that is ﬁne as long as it is before the deadline.

3) Late submissions of the assignment are not going to be marked.

4) In the second part of the assignment, you must use R for all of your computations. Please use R markdown to write the solutions for this part.

5) You can submit hand written solutions for part one of the assignment, but please combine images of your hand-written solutions with the PDF produced with R markdown as one PDF. (See https: //imagetopdf.com/ as a possible solution to combine images as one PDF).

6) Deadline: Before 11:59 pm on Tuesday, July 5

7) You can work in groups of up to four members.

Part one

You can provide hand-written solutions for this part, but it is not necessary. You are welcome to try to write your solutions with R markdown. For Part One, only use R to compute quantiles and probabilities from a t or F distribution.

1. A city tax assessor was interested in predicting residential home sales as a function of various charac- terics of the home and surrounding property. Data on 522 transactions were obtained for home sales from the previous year. We will investigate the relationship between the sales price in dollars and the number of bedrooms in the house.

We imported the data and displayed the structure of the dataframe.

real.estate<-read.csv("RealEstate.csv")

str(real.estate)

## ’data .frame’: 522 obs . of 3 variables:

## $ Identification: int 1 2 3 4 5 6 7 8 9 10 . . .

## $ Sales .price ## $ Bedrooms

: int

360000 340000 250000 205500 275500 248000 229900 150000 195000 160000 . . .

4 4 4 4 4 4 3 2 3 3 . . .

(a) We ﬁt a model that describes the sales price according to the number of bedrooms, and displayed

the corresponding ANOVA table. We also displayed the estimated coeﬃcients of the model. There is something wrong with this table. What did we forget to do? Discuss.

Hint: Look at the degrees of freedom for the shift factor.

model<-lm(Sales.price~Bedrooms,data=real.estate)

anova (model)

## Analysis of Variance Table

## Response: Sales .price

## Df Sum Sq Mean Sq F value Pr(>F)

## Bedrooms 1 1 .6931e+12 1 .6931e+12 107 .14 < 2 .2e-16 ***

## Residuals 520 8 .2178e+12 1 .5803e+10

## ---

## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’. ’0 .1 ’ ’1

coefficients(model)

## (Intercept)

## 82808 .80

Bedrooms

56200 .08

(b) We coerced the shift variable as a factor to produce the plots. Is the study balanced?

real.estate$Bedrooms<-factor(real.estate$Bedrooms)

table(real.estate$Bedrooms)

1 2 3 4

9 64 202 179

(c) Since the frequencies are small in the extremes, i.e. very small number of rooms, and very large number of rooms, we combined homes with 0, 1, or 2 rooms into one category, and homes with 5, 6, or 7 rooms into one category.

library(car)

## Loading required package: carData

real.estate$Bedrooms<-with(real.estate ,recode(Bedrooms, "c(’0’, ’1’,’2’)=’0-2’; c(’5’, ’6’,’7’)=’5-7’"))

table(real.estate$Bedrooms)

## 0-2 3 4 5-7

## 74 202 179 67

Here are comparative boxplots for the sales price of the home according to the number of bedrooms. Based on these plots, is it reasonable to assume homogeneity of variance? If not, do the plots suggest that we might be able to ﬁnd a suitable variance stabilization transformation.

# comparative boxplots

library(ggplot2)

ggplot(real.estate ,

aes (x = Bedrooms, y = Sales .price)) +

theme_bw() +

geom_boxplot(color="dark grey") +

geom_jitter(height=0 ,width=0.2) +

labs(y = "Sales price (in dollars)" ,x="Number of bedrooms")

0−2 3 4 5−7

Number of bedrooms

(d) We ﬁtted a log-log model to describe the cell standard deviation as a function of the cell mean. Based on the 95% conﬁdence interval for the slope of this log-log model, what variance-stabilization transformations are suggested?

m<- with(real .estate,tapply(Sales .price,Bedrooms,FUN=mean))

s<-with(real.estate ,tapply(Sales.price,Bedrooms,FUN=sd))

model<-lm(log(s)~log(m))

# 95% CI for intercept and for the slope

confint(model)

## 2 .5 % 97 .5 %

## (Intercept) -5 .4004299 16 .584404

## log(m) -0 .3896355 1 .365977

(e) We transformed the sales price on a logarithmic scale. Using the transformed response, we conducted the bootstrap modiﬁed Levene test, and constructed a normal qq-plot for the standardized residuals from the ﬁtted ANOVA expressing the sales price (on a logarithmic scale) according to the number of bedrooms, and also comparative boxplots. What conclusions can we draw from these plots and test in

terms of diagnostics for the one factor ANOVA model with ﬁxed eﬀects?

real.estate$log.price<-log(real.estate$Sales.price)

# comparative boxplots

library(ggplot2)

ggplot(real.estate ,

aes (x = Bedrooms, y = log .price)) +

theme_bw() +

geom_boxplot(color="dark grey") +

geom_jitter(height=0 ,width=0.2) +

labs(y = "Sales price (in log(dollars))" ,x="Number of bedrooms")

0−2 3 4 5−7

Number of bedrooms

library(lawstat)

## Attaching package: ’lawstat’

## The following object is masked from ’package:car’:

## levene .test

set .seed(100)

with(real.estate ,lawstat::levene.test(log.price, Bedrooms,

bootstrap = TRUE , num .bootstrap=1000))

## bootstrap Modified robust Brown-Forsythe Levene-type test based on the ## absolute deviations from the median

## data: log .price

## Test Statistic = 2 .054, p-value = 0 .092

model<-lm(log.price~Bedrooms,data=real.estate)

qqPlot (rstandard(model))

−3 −2 −1 0 1 2 3

norm quantiles

## [1] 69 102

(f) Since the ANOVA is robust against moderate deviations from normality when the sample size is large,

we can safely continue the analysis the ANOVA model. We ﬁtted an ANOVA model to describe the sales price (in log(dollars)) according to the number of bedrooms, and displayed the corresponding ANOVA table. Is there signiﬁcant evidence that the mean sales price (in log(dollars)) diﬀer according to the number of bedrooms?

model<-lm(log.price~Bedrooms,data=real.estate)

anova (model)

## Analysis of Variance Table

## Response: log .price

## Df Sum Sq Mean Sq F value Pr(>F)

## Bedrooms 3 23 .689 7 .8964 55 .732 < 2 .2e-16 ***

## Residuals 518 73 .394 0 . 1417

## ---

## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’. ’0 .1 ’ ’1

(g) Refer to (f). Compute the coeﬃcient of determination, and the observed value of Cohen’s f .

(h) Alternatively, let’s use a rank-based transformation to analyze the data. According to a Ranked-based

ANOVA, is there signiﬁcant evidence that that the distribution of the sales price diﬀer stochastically according to the number of bedrooms?

real.estate$ranked.Price<-rank(real.estate$Sales.price)

model<-lm(ranked .Price ~ Bedrooms,data=real .estate)

anova (model)

## Analysis of Variance Table

## Response: ranked .Price

## Df Sum Sq Mean Sq F value Pr(>F)

## Bedrooms 3 3431489 1143830 70 .358 < 2 .2e-16 ***

## Residuals 518 8421216 16257

## ---

## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’. ’0 .1 ’ ’1

(i) Refer to part (h). Compute the coeﬃcient of determination for the ranked sales price.

2. Consider a study with a completely random design with 4 treatments. Here are the summary statistics:

i n mean std. dev.

2.4

(a) Use the Tukey-Kramer method to compare the means of the groups pairwise. Identify which pairs are

HDS with a FWER of 5%.

Hint: You will need to either use ptukey or qtukey to compute probabilities or quantiles. You will have 6

comparisons to perform. You can either use conﬁdence intervals or compute p-values.

(b) Consider the 6 comparisons in part (a). Use the insert-absorb algorithm to combine the treatments that are non-HSD.

Part Two

Please R for all computations, and for building graphs in this part of the assignment. Note that we also want answers to some of this questions, that do not involve R. R will only be used for the computation, and to produce graphs. For some of these questions, the R output will not be suﬃcient. You will need to interpret, to describe, and give conclusions.

3. The data are in the ﬁle headache .csv. They are data from a completely randomized design to study the eﬀects of 5 brands of pills for the relief of headaches. The pills were given to 25 subjects with fever of 38o C or more. The response is the number of hours of relief provided by the pill. Suppose that a one-factor ANOVA model with ﬁxed eﬀects is appropriate for analyzing these data.

(a) Provide group statistics for the response according the levels of the explanatory factor, i.e. mean, standard deviation and number of units per treatment.

(b) Is the study balanced? Why?

(c) Provide a graphical display to the data for this study. Based on the plot, give one or two sentences to describe the eﬀects of the medication.

(d) Test for the signiﬁcance of the treatments eﬀects. Give your conclusion at α = 5.

(e) Use Tukey’s method for all pairwise comparisons of the treatment means. Use a FWER of α = 5%. Use to Insert-and-Absorb algorithm to group the treatments that are non-HSD, and give a table of group statistics with the corresponding labels. The analysis suggests which treatments are best?

4. In a completely randomized design to study the eﬀect of the speed of winding thread (1:slow, 2:normal, 3:fast, 4:maximum) onto 75-yard spools, 16 runs of 10,000 spools each were made at each of the four winding speeds. The response variable is the number of thread breaks during the production run. The data is in the ﬁle WindingSpeeds .csv.

(a) Import the data and display the structure of the dataframe. Use the function recode from the car package to recode the values in column i, which is the explanatory variable. Here is a sample command. I am assuming that the name of the dataframe is data.

library(car)

data$i<-factor(recode(data$i,"1=’Slow’;

2=’Normal’; 3=’Fast’; 4=’Maximum’"))

Verify that i is now a factor and display its levels, which should be “Slow”, “Normal”, “Fast”, “Maximum”.

(b) Produce comparative boxplots to describe the number of thread breaks according to the winding speed.

Based on the plots, is it reasonable to assume that the variance of the random error is constant?

(d) Fit a log-log model to describe the association between the group standard deviation and the group mean. Give a 95% conﬁdence interval for the slope. Does it suggest that a log transformation of the response might be useful for stabilizing the variance.

(e) Apply a log transformation, i.e. use the function log with R on the response. Perform the modiﬁed

Levene test on the transformed data at α = 5. Did the log transformation stabilize the variance.

(f) Using the log transformed response, contruct a QQ-plot for standardized residuals. What does the plot suggest?