Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


Data Analysis Skills: Practice Class Test Marking Scheme


Task 1. Report on Data Analysis

● Appropriate Title and Student Number 1 MARK

Please Note: the code chunks and the mathematical LaTeX code ($ and $$) have been included in Task 1 to show you how the output included in the report was generated. In the final .pdf le the code chunks and the code betwen $$ SHOULD NOT BE SHOWN for Task 1 (but should be shown for Task 2).

library(tidyverse)

library(moderndive)

library(skimr)

library (kableExtra)

library(gridExtra)


cats  <- read.csv("cats.csv")


Introduction

● Introduction to the data being analysed and to the question of interest. Marks deducted for copying the data description as given. 2 MARKS


Exploratory Data Analysis

● Summary statistics on heart weight by sex with appropriate comments.  One mark removed if the output is simply‘copy-pasted’ from R.

cats  %>%

group_by (Sex)  %>%

summarise(n=n (),Mean=round(mean (Hwt),digits= 1),  St.Dev=round(sd(Hwt),digits= 1), Min=min (Hwt),  Q1  =  quantile (Hwt,0.25),  Median=median (Hwt),

Q3  =  quantile (Hwt,0.75),  Max=max(Hwt))  %>%

kable(caption  =  !\\label{tab:summaries}  Summary  statistics  on

heart  weight  by  sex  of  144  adult  cats. !)  %>%

kable_styling(latex_options  =  "hold_position")


Table 1:   Summary statistics on heart weight by sex of 144 adult cats.

Sex

n

Mean

St.Dev

Min

Q1

Median

Q3

Max

F

47

9.2

1.4

6.3

8.35

9.1

10.1

13.0

M

97

11.3

2.5

6.5

9.40

11.4

12.8

20.5


Alternatively, the summary table could be produced using skimr package:

my_skim  <- skim_with(base = sfl(n = length))

cats  %>%

group_by (Sex)  %>%

select (Hwt,  Sex)  %>%

my_skim()  %>%

transmute(Variable=skim_variable,  Sex=Sex,  n=n,  Mean=numeric.mean,  SD=numeric.sd ,

Min=numeric.p0,  Q1=numeric.p25,  Median=numeric.p50,  Q3=numeric.p75,

Max=numeric.p100,  IQR  =  numeric.p75-numeric.p50)  %>%

kable(caption  =   !\\label{tab:summary}  Summary  statistics  on  heart  weight  by  sex.  (produced  using  ski

booktabs  =  TRUE ,  linesep  =  "" ,  digits  =  2)  %>%

kable_styling(font_size  =  10 ,  latex_options  =  "hold_position")


Table 2:   Summary statistics on heart weight by sex. (produced using skimr package).

Variable Sex n Mean SD Min Q1 Median Q3 Max IQR

Hwt          F       47      9.20    1.36      6.3    8.35           9.1    10.1     13.0      1.0

Hwt M 97 11.32 2.54 6.5 9.40 11.4 12.8 20.5 1.4


2 MARKS

● Comments on the summary statistics related to the question of interest. 1 MARK

● Boxplot of heart weight by sex. One mark removed if the plot is not appropriately labelled, and axis labels not adjusted accordingly.

~~~{r  boxplot,  out.width  =  !68% ! ,  fig.align  =  "center",

fig.cap  =  "\\label{fig:box}  Heart  weight  by  Sex.",  fig.pos  =  !H!}

ggplot(cats,  aes(x  =  Sex,  y  =  Hwt))  +

geom_boxplot()  +

labs(x  =  "Sex",  y  =  "Heart  weight  (grams)",

title  =  "Heart  weights  of  144  adult  cats")

~~~

Heart weights of 144 adult cats


20

15

M

Sex

Figure 1:   Heart weight by Sex.


3 MARKS

●  Comments on the boxplot related to the question of interest. 2 MARKS


Formal Data Analysis

●  State the linear regression model being fitted, i.e.


Hwt = + Male · Ⅱ Male(z)

$$\widehat{\mbox{Hwt}}  =  \widehat{\alpha}  +

\widehat{\beta}_{\mbox{Male}}  \cdot  \mathbb{I}_{\mbox{Male}}(x)  $$

where

● the intercept $\widehat{\alpha}$ is the mean heart weight for the baseline category of Females;

Male  $\widehat{\beta}_{\mbox{Male}}$ is the difference in the mean heart weight of a Males relative to the baseline category Females; and

● Ⅱ Male(z) $\mathbb{I}_{\mbox{Male}}(x)$ is an indicator function such that Ⅱ Male(z) = y0(1)   O(if)ther(Sex)w(o)

$$\mathbb{I}_{\mbox{Male}}(x)=\left\{

\begin{array}{ll}

1  ~~~  \mbox{if  Sex  of}  ~  x  \mbox{th  observation  is  Male},\\

0  ~~~  \mbox{Otherwise}.\\

\end{array}

\right.$$


2 MARKS

●  Report the estimated model coeffecients.  One mark removed if the regression output is simply ‘copy- pasted’ from R.

model  <- lm (Hwt ~ Sex, data = cats)

table_values  <- get_regression_table(model)

table_values  %>%

dplyr ::select(term,estimate,  lower_ci,  upper_ci,  p_value)  %>%

#Note that it seems necessary to include dplyr:: here!!

kable(caption  =  !\\label{tab:reg}  Estimates  of  the  parameters  from  the  fitted  linear regression  model. ! ,

col.names  =  c ( "Term" ,  "Estimate" ,  "CI  Lower  Bound" ,  "CI  Upper  Bound" ,  "p  value"),

align=rep ( !c ! ,  5))  %>%

kable_styling(latex_options  =  !HOLD_position!,  )


Table 3:   Estimates of the parameters from the fitted linear regression model.

Term

Estimate

CI Lower Bound

CI Upper Bound

p value

intercept

9.202

8.560

9.845

0

Sex: M

2.121

1.338

2.904

0


4 MARKS

●  Appropriate comments on the regression coefficients and the difference between males and females. 4 MARKS


NB: THE DIAGNOSTICS IN THE REMAINDER OF THIS ANALYSIS SECTION SHOULD NOT BE INCLUDED IN THE CLASS TEST SINCE THESE PLOTS (MOSTLY) SUPPORT THE ASSUMPTIONS OF THE FITTED MODEL

●  Plots for checking model assumptions.


~~~{r  residplots,  echo=FALSE,  fig.width  =  13,  fig.align  =  "center",

fig.cap  =  "\\label{fig:resids}  Scatterplots  of  the  residuals  by  Sex  (left)  and  a  histogram  of  the  residuals  (right).",  fig.pos  =  !H!, message  =  FALSE} regression.points  <- get_regression_points(model)

p1  <- ggplot(regression.points, aes(x = Sex, y = residual)) +

geom_jitter(width  =  0.1)  +

labs(x  =  "Sex",  y  =  "Residual")  +

geom_hline(yintercept  =  0,  col  =  "blue")

p2  <- ggplot(regression.points, aes(x = residual)) +

geom_histogram(color  =  "white")  +

labs(x  =  "Residual")

grid.arrange(p1,  p2,  ncol  =  2)

~~~


5

0

−5

M

F

Sex

15

10

5

0

−5 0                                                    5                                                   10

Residual

Figure 2:   Scatterplots of the residuals by Sex (left) and a histogram of the residuals (right).


Conclusions

●  Overall conclusions with an answer to the question of interest.


2 MARKS

●  General report layout.   This include figure and table captions, labeling, positioning and quality of English.


2 MARKS

●  DEDUCT 2 MARKS if R code appeared in the Report in Task 1

Total: 25 MARKS