Assignment 3 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Assignment 3
2022
Instructions
1) Please submit your solutions to this assignment in one PDF file in brightspace. Only one file will be accepted.
2) You can submit a PDF file more than once. However, only the last submission will be saved. If you want to modify your submitted assignment, that is fine as long as it is before the deadline.
3) Late submissions of the assignment are not going to be marked.
4) In the second part of the assignment, you must use R for all of your computations. Please use R markdown to write the solutions for this part.
5) You can submit hand written solutions for part one of the assignment, but please combine images of your hand-written solutions with the PDF produced with R markdown as one PDF. (See https: //imagetopdf.com/ as a possible solution to combine images as one PDF).
6) Deadline: Before 11:59 pm on Tuesday, July 5
7) You can work in groups of up to four members.
Part one
You can provide hand-written solutions for this part, but it is not necessary. You are welcome to try to write your solutions with R markdown. For Part One, only use R to compute quantiles and probabilities from a t or F distribution.
1. A city tax assessor was interested in predicting residential home sales as a function of various charac- terics of the home and surrounding property. Data on 522 transactions were obtained for home sales from the previous year. We will investigate the relationship between the sales price in dollars and the number of bedrooms in the house.
We imported the data and displayed the structure of the dataframe.
real.estate<-read.csv("RealEstate.csv")
str(real.estate)
## ’data .frame’: 522 obs . of 3 variables:
## $ Identification: int 1 2 3 4 5 6 7 8 9 10 . . .
## $ Sales .price ## $ Bedrooms
: int
: int
360000 340000 250000 205500 275500 248000 229900 150000 195000 160000 . . .
4 4 4 4 4 4 3 2 3 3 . . .
(a) We fit a model that describes the sales price according to the number of bedrooms, and displayed
the corresponding ANOVA table. We also displayed the estimated coefficients of the model. There is something wrong with this table. What did we forget to do? Discuss.
Hint: Look at the degrees of freedom for the shift factor.
model<-lm(Sales.price~Bedrooms,data=real.estate)
anova (model)
## Analysis of Variance Table
##
## Response: Sales .price
## Df Sum Sq Mean Sq F value Pr(>F)
## Bedrooms 1 1 .6931e+12 1 .6931e+12 107 .14 < 2 .2e-16 ***
## Residuals 520 8 .2178e+12 1 .5803e+10
## ---
## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’. ’0 .1 ’ ’1
coefficients(model)
## (Intercept)
## 82808 .80
Bedrooms
56200 .08
(b) We coerced the shift variable as a factor to produce the plots. Is the study balanced?
real.estate$Bedrooms<-factor(real.estate$Bedrooms)
table(real.estate$Bedrooms)
## ## ## |
0 1 |
1 2 3 4 9 64 202 179 |
5 52 |
6 12 |
7 3 |
(c) Since the frequencies are small in the extremes, i.e. very small number of rooms, and very large number of rooms, we combined homes with 0, 1, or 2 rooms into one category, and homes with 5, 6, or 7 rooms into one category.
library(car)
## Loading required package: carData
real.estate$Bedrooms<-with(real.estate ,recode(Bedrooms, "c(’0’, ’1’,’2’)=’0-2’; c(’5’, ’6’,’7’)=’5-7’"))
table(real.estate$Bedrooms)
##
## 0-2 3 4 5-7
## 74 202 179 67
Here are comparative boxplots for the sales price of the home according to the number of bedrooms. Based on these plots, is it reasonable to assume homogeneity of variance? If not, do the plots suggest that we might be able to find a suitable variance stabilization transformation.
# comparative boxplots
library(ggplot2)
ggplot(real.estate ,
aes (x = Bedrooms, y = Sales .price)) +
theme_bw() +
geom_boxplot(color="dark grey") +
geom_jitter(height=0 ,width=0.2) +
labs(y = "Sales price (in dollars)" ,x="Number of bedrooms")
0−2 3 4 5−7 Number of bedrooms |
(d) We fitted a log-log model to describe the cell standard deviation as a function of the cell mean. Based on the 95% confidence interval for the slope of this log-log model, what variance-stabilization transformations are suggested?
m<- with(real .estate,tapply(Sales .price,Bedrooms,FUN=mean))
s<-with(real.estate ,tapply(Sales.price,Bedrooms,FUN=sd))
model<-lm(log(s)~log(m))
# 95% CI for intercept and for the slope
confint(model)
## 2 .5 % 97 .5 %
## (Intercept) -5 .4004299 16 .584404
## log(m) -0 .3896355 1 .365977
(e) We transformed the sales price on a logarithmic scale. Using the transformed response, we conducted the bootstrap modified Levene test, and constructed a normal qq-plot for the standardized residuals from the fitted ANOVA expressing the sales price (on a logarithmic scale) according to the number of bedrooms, and also comparative boxplots. What conclusions can we draw from these plots and test in
terms of diagnostics for the one factor ANOVA model with fixed effects?
real.estate$log.price<-log(real.estate$Sales.price)
# comparative boxplots
library(ggplot2)
ggplot(real.estate ,
aes (x = Bedrooms, y = log .price)) +
theme_bw() +
geom_boxplot(color="dark grey") +
geom_jitter(height=0 ,width=0.2) +
labs(y = "Sales price (in log(dollars))" ,x="Number of bedrooms")
0−2 3 4 5−7 Number of bedrooms |
library(lawstat)
##
## Attaching package: ’lawstat’
## The following object is masked from ’package:car’:
##
## levene .test
set .seed(100)
with(real.estate ,lawstat::levene.test(log.price, Bedrooms,
bootstrap = TRUE , num .bootstrap=1000))
##
## bootstrap Modified robust Brown-Forsythe Levene-type test based on the ## absolute deviations from the median
##
## data: log .price
## Test Statistic = 2 .054, p-value = 0 .092
model<-lm(log.price~Bedrooms,data=real.estate)
qqPlot (rstandard(model))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−3 −2 −1 0 1 2 3
norm quantiles
## [1] 69 102
(f) Since the ANOVA is robust against moderate deviations from normality when the sample size is large,
we can safely continue the analysis the ANOVA model. We fitted an ANOVA model to describe the sales price (in log(dollars)) according to the number of bedrooms, and displayed the corresponding ANOVA table. Is there significant evidence that the mean sales price (in log(dollars)) differ according to the number of bedrooms?
model<-lm(log.price~Bedrooms,data=real.estate)
anova (model)
## Analysis of Variance Table
##
## Response: log .price
## Df Sum Sq Mean Sq F value Pr(>F)
## Bedrooms 3 23 .689 7 .8964 55 .732 < 2 .2e-16 ***
## Residuals 518 73 .394 0 . 1417
## ---
## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’. ’0 .1 ’ ’1
(g) Refer to (f). Compute the coefficient of determination, and the observed value of Cohen’s f .
(h) Alternatively, let’s use a rank-based transformation to analyze the data. According to a Ranked-based
ANOVA, is there significant evidence that that the distribution of the sales price differ stochastically according to the number of bedrooms?
real.estate$ranked.Price<-rank(real.estate$Sales.price)
model<-lm(ranked .Price ~ Bedrooms,data=real .estate)
anova (model)
## Analysis of Variance Table
##
## Response: ranked .Price
## Df Sum Sq Mean Sq F value Pr(>F)
## Bedrooms 3 3431489 1143830 70 .358 < 2 .2e-16 ***
## Residuals 518 8421216 16257
## ---
## Signif . codes: 0 ’***’ 0 .001 ’**’ 0 .01 ’*’ 0 .05 ’. ’0 .1 ’ ’1
(i) Refer to part (h). Compute the coefficient of determination for the ranked sales price.
2. Consider a study with a completely random design with 4 treatments. Here are the summary statistics:
i n mean std. dev.
2.4
15
13
20
(a) Use the Tukey-Kramer method to compare the means of the groups pairwise. Identify which pairs are
HDS with a FWER of 5%.
Hint: You will need to either use ptukey or qtukey to compute probabilities or quantiles. You will have 6
comparisons to perform. You can either use confidence intervals or compute p-values.
(b) Consider the 6 comparisons in part (a). Use the insert-absorb algorithm to combine the treatments that are non-HSD.
Part Two
Please R for all computations, and for building graphs in this part of the assignment. Note that we also want answers to some of this questions, that do not involve R. R will only be used for the computation, and to produce graphs. For some of these questions, the R output will not be sufficient. You will need to interpret, to describe, and give conclusions.
3. The data are in the file headache .csv. They are data from a completely randomized design to study the effects of 5 brands of pills for the relief of headaches. The pills were given to 25 subjects with fever of 38o C or more. The response is the number of hours of relief provided by the pill. Suppose that a one-factor ANOVA model with fixed effects is appropriate for analyzing these data.
(a) Provide group statistics for the response according the levels of the explanatory factor, i.e. mean, standard deviation and number of units per treatment.
(b) Is the study balanced? Why?
(c) Provide a graphical display to the data for this study. Based on the plot, give one or two sentences to describe the effects of the medication.
(d) Test for the significance of the treatments effects. Give your conclusion at α = 5.
(e) Use Tukey’s method for all pairwise comparisons of the treatment means. Use a FWER of α = 5%. Use to Insert-and-Absorb algorithm to group the treatments that are non-HSD, and give a table of group statistics with the corresponding labels. The analysis suggests which treatments are best?
4. In a completely randomized design to study the effect of the speed of winding thread (1:slow, 2:normal, 3:fast, 4:maximum) onto 75-yard spools, 16 runs of 10,000 spools each were made at each of the four winding speeds. The response variable is the number of thread breaks during the production run. The data is in the file WindingSpeeds .csv.
(a) Import the data and display the structure of the dataframe. Use the function recode from the car package to recode the values in column i, which is the explanatory variable. Here is a sample command. I am assuming that the name of the dataframe is data.
library(car)
data$i<-factor(recode(data$i,"1=’Slow’;
2=’Normal’; 3=’Fast’; 4=’Maximum’"))
Verify that i is now a factor and display its levels, which should be “Slow”, “Normal”, “Fast”, “Maximum”.
(b) Produce comparative boxplots to describe the number of thread breaks according to the winding speed.
Based on the plots, is it reasonable to assume that the variance of the random error is constant?
(c) Perform the modified Levene test at α = 5. Does it support your findings from part (b).
(d) Fit a log-log model to describe the association between the group standard deviation and the group mean. Give a 95% confidence interval for the slope. Does it suggest that a log transformation of the response might be useful for stabilizing the variance.
(e) Apply a log transformation, i.e. use the function log with R on the response. Perform the modified
Levene test on the transformed data at α = 5. Did the log transformation stabilize the variance.
(f) Using the log transformed response, contruct a QQ-plot for standardized residuals. What does the plot suggest?
2022-07-26