COURSEWORK 3 (OF 4) FOR MATH69531 GENERAL INSURANCE 2022/2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
COURSEWORK 3 (OF 4) FOR MATH69531 GENERAL INSURANCE 2022/2023
1. For any two sequences of numbers a1 , a2 , . . . , an and b1 , b2 , . . . , bn we denote
n n n
sab := (ai − )(bi − ) ∈ R, saa := (ai − )2 ≥ 0 and sbb := (bi − )2 ≥ 0,
where as usual and denote the (sample) means. Provided that saa > 0 and sbb > 0 (i.e. the sequences do not consist of all identical numbers), the sample correlation coefficient is defined as:
rab = sab ∈ [ − 1, 1].
You probably remember that rab reflects how strong the linear relationship between both se-
quences is. A value close to − 1 or 1 suggests a strong linear relationship, and the extreme case |rab | = 1 occurs for instance if c 0 and d exist so that ai = cbi + d for all i = 1, . . . , n. On the other hand, a value close to 0 (resp. equal to 0) suggests hardly any (resp. no) linear relationship (this does not necessarily mean there can’t be other types of relationships of course).
***
Now consider a set of n points in R3 denoted
(z1 , w1 , y1 ), (z2 , w2 , y2 ), . . . , (zn , wn , yn )
so that szz > 0, sww > 0 and |rzw | 1. We assume that these n points are observations from a Linear Model of the form
Yi = β0 + β1 zi + β2wi + εi for i = 1, . . . , n, (1)
where β0 , β1 , β2 are unknown parameters and as usual the εi ’s are iid zero mean random variables with common (unknown) variance σ 2 .
Note that we may equivalently write (1) as
Yi = α + β1 (zi − ) + β2 (wi − ) + εi for i = 1, . . . , n, (2)
provided we set α := β0 +β1 +β2 . This is an equally valid Linear Model representation of the data, with unknown parameters α, β1 , β2 and predictors (in our standard notation) xi1 = 1, xi2 = zi − and xi3 = wi − for i = 1, . . . , n. Representation (2) turns out to be easier to work with and is hence recommended to answer below questions with.
(a) Write down the design matrix for the model (2).
[2 marks]
(b) Denote by 1 and 2 the Least Square Estimators of β1 and β2 respectively. Show that
szz − sz(2)w /sww sww − sz(2)w /szz .
[7 marks]
[This question is continued on the next page]
For convenience we give the model a bit more context: suppose that the response y models the amount of time it takes a student to complete this coursework, and that the response depends on X ∈ [0, 1] (the fraction of their time the student has spent working on this material during the past three weeks) and w ∈ [0, 1] (a measure for how much they like statistics). Suppose that you are the person conducting this experiment, and that you have a very large pool of students with a large variety in X and w values available to make your n measurements (X1 , w1 , y1 ), . . . , (Xn , wn , yn ) from. (For clarity: the assumptions on the errors ei listed under (1) remain in force).
(c) Explain how the variances computed in part (b) show that it is not desirable to choose your
students so that Tzw is close to − 1 or 1.
[6 marks]
(d) Now give your best reason in words, without using (too much) maths and without using the variances computed in part (b), why it is not a good idea to choose your students so that Tzw is close to − 1 or 1.
[4 marks]
2. On our course Blackboard page, in the folder All things coursework you can find the file Car data CW3 .txt. Download this file to your computer. It contains data related to 1000 car accidents collected by an insurance company for a specific type of car. For each accident the following has been recorded:
Name |
Description |
MarketValue |
value of the car at the time of the accident |
Speed |
the speed of the car at the time of the accident |
SeatbeltIndicator |
is 0 if the driver did not have their seatbelt fastened and 1 if they did |
DamageAmount |
the total amount the insurance company had to pay out, for damage to the car and (possibly) personal injury of the driver |
We are looking to fit a Linear Model to this data, where DamageAmount is the response and the other bits of recorded data can be used to form the predictors. We do this using R, cf. Chapter 11 in the notes.
You are asked to work through the following steps.
1. Write down a first (sensible) guess for a Linear Model for the given data.
2. Use R to fit the Linear Model you have chosen to the data.
3. Discuss what the (relevant) output from R tells you about how good (or bad) the Linear Model you have chosen fits the data.
4. Try to come up with a Linear Model that you expect to provide a better fit to the data. Explain why you think the new model will give a better fit. Then work through points 2. and 3. again for this new model, in 3. also explaining how your second model compares to your first model.
You are very welcome to try more than two models (i.e. to execute step 4. more than once) if you are enjoying yourself, but this is not required.
Note: your answer to this question is not required to contain the ‘most perfect’ model for this data (for as far as that exists!). Rather full marks are given if you execute the above 4 steps fully, sensibly and correctly in your answer, for which in particular you should include both the R code you produced with the (relevant) output, and you should provide the discussion/motivation steps 3. and 4. are asking for. As in previous courseworks, just handwritten working in which you refer to R code/R output printed on separate sheets (attached to your solutions of course) is perfectly fine.
[18 marks]
2022-12-09