Data Analysis Skills: Practice Class Test Marking Scheme
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Data Analysis Skills: Practice Class Test Marking Scheme
Task 1. Report on Data Analysis
● Appropriate Title and Student Number 1 MARK
Please Note: the code chunks and the mathematical LaTeX code ($ and $$) have been included in Task 1 to show you how the output included in the report was generated. In the final .pdf file the code chunks and the code betwen $$ SHOULD NOT BE SHOWN for Task 1 (but should be shown for Task 2).
library(tidyverse)
library(moderndive)
library(skimr)
library (kableExtra)
library(gridExtra)
cats <- read.csv("cats.csv")
Introduction
● Introduction to the data being analysed and to the question of interest. Marks deducted for copying the data description as given. 2 MARKS
Exploratory Data Analysis
● Summary statistics on heart weight by sex with appropriate comments. One mark removed if the output is simply‘copy-pasted’ from R.
cats %>%
group_by (Sex) %>%
summarise(n=n (),Mean=round(mean (Hwt),digits= 1), St.Dev=round(sd(Hwt),digits= 1), Min=min (Hwt), Q1 = quantile (Hwt,0.25), Median=median (Hwt),
Q3 = quantile (Hwt,0.75), Max=max(Hwt)) %>%
kable(caption = !\\label{tab:summaries} Summary statistics on
heart weight by sex of 144 adult cats. !) %>%
kable_styling(latex_options = "hold_position")
Table 1: Summary statistics on heart weight by sex of 144 adult cats.
Sex |
n |
Mean |
St.Dev |
Min |
Q1 |
Median |
Q3 |
Max |
F |
47 |
9.2 |
1.4 |
6.3 |
8.35 |
9.1 |
10.1 |
13.0 |
M |
97 |
11.3 |
2.5 |
6.5 |
9.40 |
11.4 |
12.8 |
20.5 |
Alternatively, the summary table could be produced using skimr package:
my_skim <- skim_with(base = sfl(n = length))
cats %>%
group_by (Sex) %>%
select (Hwt, Sex) %>%
my_skim() %>%
transmute(Variable=skim_variable, Sex=Sex, n=n, Mean=numeric.mean, SD=numeric.sd ,
Min=numeric.p0, Q1=numeric.p25, Median=numeric.p50, Q3=numeric.p75,
Max=numeric.p100, IQR = numeric.p75-numeric.p50) %>%
kable(caption = !\\label{tab:summary} Summary statistics on heart weight by sex. (produced using ski
booktabs = TRUE , linesep = "" , digits = 2) %>%
kable_styling(font_size = 10 , latex_options = "hold_position")
Table 2: Summary statistics on heart weight by sex. (produced using skimr package).
Variable Sex n Mean SD Min Q1 Median Q3 Max IQR
Hwt F 47 9.20 1.36 6.3 8.35 9.1 10.1 13.0 1.0
Hwt M 97 11.32 2.54 6.5 9.40 11.4 12.8 20.5 1.4
2 MARKS
● Comments on the summary statistics related to the question of interest. 1 MARK
● Boxplot of heart weight by sex. One mark removed if the plot is not appropriately labelled, and axis labels not adjusted accordingly.
~~~{r boxplot, out.width = !68% ! , fig.align = "center",
fig.cap = "\\label{fig:box} Heart weight by Sex.", fig.pos = !H!}
ggplot(cats, aes(x = Sex, y = Hwt)) +
geom_boxplot() +
labs(x = "Sex", y = "Heart weight (grams)",
title = "Heart weights of 144 adult cats")
~~~
Heart weights of 144 adult cats
20 15
M Sex |
Figure 1: Heart weight by Sex.
3 MARKS
● Comments on the boxplot related to the question of interest. 2 MARKS
Formal Data Analysis
● State the linear regression model being fitted, i.e.
H一wt = + Male · Ⅱ Male(z)
$$\widehat{\mbox{Hwt}} = \widehat{\alpha} +
\widehat{\beta}_{\mbox{Male}} \cdot \mathbb{I}_{\mbox{Male}}(x) $$
where
● the intercept $\widehat{\alpha}$ is the mean heart weight for the baseline category of Females;
● Male $\widehat{\beta}_{\mbox{Male}}$ is the difference in the mean heart weight of a Males relative to the baseline category Females; and
● Ⅱ Male(z) $\mathbb{I}_{\mbox{Male}}(x)$ is an indicator function such that Ⅱ Male(z) = y0(1) O(if)ther(Sex)w(o)
$$\mathbb{I}_{\mbox{Male}}(x)=\left\{
\begin{array}{ll}
1 ~~~ \mbox{if Sex of} ~ x \mbox{th observation is Male},\\
0 ~~~ \mbox{Otherwise}.\\
\end{array}
\right.$$
2 MARKS
● Report the estimated model coeffecients. One mark removed if the regression output is simply ‘copy- pasted’ from R.
model <- lm (Hwt ~ Sex, data = cats)
table_values <- get_regression_table(model)
table_values %>%
dplyr ::select(term,estimate, lower_ci, upper_ci, p_value) %>%
#Note that it seems necessary to include dplyr:: here!!
kable(caption = !\\label{tab:reg} Estimates of the parameters from the fitted linear regression model. ! ,
col.names = c ( "Term" , "Estimate" , "CI Lower Bound" , "CI Upper Bound" , "p value"),
align=rep ( !c ! , 5)) %>%
kable_styling(latex_options = !HOLD_position!, )
Table 3: Estimates of the parameters from the fitted linear regression model.
Term |
Estimate |
CI Lower Bound |
CI Upper Bound |
p value |
intercept |
9.202 |
8.560 |
9.845 |
0 |
Sex: M |
2.121 |
1.338 |
2.904 |
0 |
4 MARKS
● Appropriate comments on the regression coefficients and the difference between males and females. 4 MARKS
NB: THE DIAGNOSTICS IN THE REMAINDER OF THIS ANALYSIS SECTION SHOULD NOT BE INCLUDED IN THE CLASS TEST SINCE THESE PLOTS (MOSTLY) SUPPORT THE ASSUMPTIONS OF THE FITTED MODEL
● Plots for checking model assumptions.
~~~{r residplots, echo=FALSE, fig.width = 13, fig.align = "center",
fig.cap = "\\label{fig:resids} Scatterplots of the residuals by Sex (left) and a histogram of the residuals (right).", fig.pos = !H!, message = FALSE} regression.points <- get_regression_points(model)
p1 <- ggplot(regression.points, aes(x = Sex, y = residual)) +
geom_jitter(width = 0.1) +
labs(x = "Sex", y = "Residual") +
geom_hline(yintercept = 0, col = "blue")
p2 <- ggplot(regression.points, aes(x = residual)) +
geom_histogram(color = "white") +
labs(x = "Residual")
grid.arrange(p1, p2, ncol = 2)
~~~
Sex |
15
10
5
0 −5 0 5 10 Residual |
Figure 2: Scatterplots of the residuals by Sex (left) and a histogram of the residuals (right).
Conclusions
● Overall conclusions with an answer to the question of interest.
2 MARKS
● General report layout. This include figure and table captions, labeling, positioning and quality of English.
2 MARKS
● DEDUCT 2 MARKS if R code appeared in the Report in Task 1
Total: 25 MARKS
2022-02-08