闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH6169 Flexible Regression

Coursework 2021/22

Part A [50 marks]

1. [20 marks] Here we use the data taken from Agresti (2013: 219) on graduate school applications to the 23 departments within the College of Liberal Arts and Sciences at the University of Florida during the 1997–1998 academic year. The dataset contains the department ID (department), the number of applications (napplied), and the number of students admitted (nadmitted) cross-classiﬁed by gender (female) and is available as STATA-ﬁle admissions.dta. The underlying research question is whether admission decisions are independent of gender.

a) Describe the logistic regression model (including the associated log-likelihood) precisely for this case taking the number admitted out of the number applied as response and gender and de- partment as ﬁxed eﬀects. Provide and discuss the results of your model ﬁt. What are the disadvantages of this model? [5 marks]

b) Now precisely describe the mixed eﬀects logistic regression model (including the associated log- likelihood) which takes gender as ﬁxed eﬀect and department as random eﬀect. Provide and discuss the results of your model ﬁt. What are the beneﬁts of the model in comparison to the model used in a)? Is the random department eﬀect required in the mixed model? Provide an evaluation table using AIC and BIC. [5 marks]

c) Now investigate whether there is a gender-department interaction. Clearly state the random eﬀects distribution. Assuming that a random intercept is required, ﬁt the model with random slope for gender with and without allowing correlation with the random intercept. Provide and discuss the results of your model ﬁt. Add this model to your model evaluation table. [8 marks]

d) Provide a discussion of your analysis including a graph of admitted female students and their model ﬁts versus department and derive some conclusions in relation to the underlying research question. [2 marks]

2. [30 marks] This part of the coursework will be using a longitudinal dataset from Diggle et al. (2002), consisting of weight measurements of 48 pigs on 9 successive weeks. Pigs are identiﬁed by the variable id and is available as STATA-ﬁle pigs.dta.

a) Explore the data by providing a spaghetti-plot (plot each pig in its weight over time). Give an interpretation of the plot. [5 marks]

b) Describe precisely the model which takes week as a continuous linear eﬀect and includes a random eﬀect for pig. Provide the results from the ﬁt of the model. You may assume that the response weight is normally distributed. [5 marks]

c) Now extend the model in b) and include a random slope for the week eﬀect. Give the precise mathematical formulation and provide the results of your analysis. There are three basic models here: only random intercept, independent random intercept and slope, correlated random inter- cept and slope. Provide a model evaluation table using the appropriate likelihood. [10 marks]

d) A diﬀerent approach uses pig and week as crossed random eﬀects. The formal model is weightij = β0 + β1weekij + ui + wj + eij

where ui is the random eﬀect for pig i and wj the random eﬀect for week j. We assume that

ui ~ N (0, σu(2)), wj ~ N (0, σw(2)), eij ~ N (0, σ2 )

and all independent. Provide a graphical sketch what this model means. Provide a model analysis and discuss your results. Does this model provide a better ﬁt than the random coeﬃcient model? [Hint: to do the ﬁt in STATA you need to leave the level ﬁeld empty and ﬁll in the crossed factor ﬁeld the relevant random eﬀect.] [10 marks]

Part B [50 marks]

A botanist has recently discovered a new ﬂowering plant species and is interested in its nectar production across diﬀerent regions. One question of interest is the relationship between temperature and the nectar production per month per ﬂower on average. There are three variables in the nectar.csv ﬁle:

● temp: the temperature of the region when the nectar production is measured.

● weight: the average amount (in mg) of nectar produced per month per plant on average.

● group: the grouping information to be later used for cross-validation.

1. [1 mark] Read the data in the nectar.csv ﬁle into R and create a scatterplot of temp (x-axis) versus weight (y-axis). Describe the relationship between these two variables.

2. [8 marks] This question concerns with ﬁtting a regression with a cubic spline using B-spline basis, where weight is the response and temp is the covariate:

(a) Use the bs() function to create B-splines, including the intercept, with knot positions at temp values of 7.5, 15, 22.5 and 30. [2 marks]

(b) Fit a cubic spline regression with the basis from part (a) and provide the summary output of this model. [2 marks]

(c) Create another scatterplot as in Question 1 and add the ﬁtted values (as a red line) of the cubic spline regression from part (b) to this plot. [1 mark]

(d) Use the predict() function to obtain the 95% conﬁdence intervals for the ﬁtted values. [2 marks]

(e) Add the conﬁdence intervals (as blue lines) to the plot in part (c). [1 mark]

3. [10 marks] Fit a regression with a cubic spline using the truncated power series basis, where weight is the response and temp is the covariate:

(a) Create the truncated power series basis with knot positions at temp values of 7.5, 15, 22.5 and 30. [5 marks]

(b) Fit a cubic spline regression with the basis from part (a) and provide the summary output of this model. [2 marks]

(c) How do the uncertainty associated with the coeﬃcient estimates in part (b) compare to those obtained in Question 2? Explain the diﬀerences you have observed. [3 marks]

4. [14 marks] Fit a regression with a natural cubic spline, where weight is the response and temp is the covariate.

(a) If we want to ﬁt a natural cubic spline that has the same degrees of freedom as the cubic spline model in Question 2, how many internal knots are there? [3 marks] (b) Use the ns() function to create the basis for a natural cubic spline, including the intercept, and set the boundary knots to 7 and 28 degrees C for temp. The number of internal knots is what you have calculated for part (a), and the internal knot positions divide the interval between the boundary knot values of temp equally. Also, assume that all the internal knots are not equal to

the boundary knots.

[2 marks] [2 marks]

(d) Create a scatterplot as in Question 1 and add the ﬁtted values of part (c) (as a red line) to this plot. [1 mark]

(e) Use the predict function to obtain the 95% conﬁdence interval for ﬁtted values of the natural

cubic spline obtained in part (d).

(f) Add the conﬁdence intervals from part (e) (as blue lines) to the plot in part (d). (g) Does the scatterplot in part (f) display any problems? Explain your answer.

[2 marks] [1 mark] [3 marks]

5. [7 marks] Fit smoothing splines using the smooth.spline() function. As above, weight is the response and temp is the covariate.

(a) Fit a smoothing spline with 5 eﬀective degrees of freedom (df) and print out the model. [1 mark]

(b) Fit a smoothing spline with eﬀective df = 10 and print out the model.

[1 mark] [1 mark]

(d) For each of the models in parts (a), (b) and (c), create a scatterplot as in Question 1 and add

the ﬁtted values (as a red line). [1 marks] (e) Examine and compare the ﬁtted values from parts (a), (b) and (c), and discuss any concerns you have observed. [3 marks]

6. [10 marks] Cross-validation (CV) method can be used for choosing the smoothing parameter of the smoothing spline model, but it can be more generally applied to other methods. In this question, we use CV to choose the number of internal knots in a natural cubic spline.

Set the boundary knots to temp values of 1 and 34. The candidate number of internal knots are integer values from 1 to 10 inclusive. The internal knots positions divide the interval between the boundary knots of temp into equal subintervals. We assume that all the internal knots are not equal to the

boundary knots. The group variable divides the observations into 10 equal subsets, each containing 25 observations.

Write your own R commands to perform 10-fold CV to calculate the overall CV error when ﬁtting a natural cubic spline with each of the candidate number of internal knots in turn. You must use the groupings deﬁned by group for the cross-validation. Present the overall CV errors for the 10 candidate number of internal knots, and also plot the overall CV errors (y-axis) against number of internal knots (x-axis).

IMPORTANT: The R codes required for this question must be reasonably easy to follow by the marker.