关键词 > MATH6143/3085

MATH6143/3085 – Data Analysis Project

发布时间:2023-01-11

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MATH6143/3085 – Data Analysis Project

Instructions:

This coursework is worth 30% of the overall mark for the module.

• Completed work should be uploaded (as a single PDF file) to Blackboard before 2359 on Tuesday 10th January 2023 (UK time).

• There is a strict page limit of five pages of A4.  This includes any title page.  Any work beyond this limit will not be marked. Five pages is sufficient to achieve full marks. Text must not be smaller than 12pt font and margins must be no smaller than 2cm.

• It is permissible to include R plots in your submission.   Other analysis done using R must be integrated into your text or in properly formatted tables – cutting and pasting verbatim sections of code and output from R is not acceptable (imagine you are writing a report for someone who is unfamiliar with R).

• These questions involve the modelling of real data. There is not necessarily a single “correct” answer. Careful explanation and clear presentation are important.

• All coursework must be carried out and written up independently.  You are reminded of the University’s Academic Integrity Policy, see https://www.southampton.ac.uk/quality/ assessment/academic_integrity.page.

 Data files can be downloaded from Blackboard.

1.  (8 marks) The data in file lymphoma .csv represent survival times (in weeks) for 38 patients with lymphocytic non-Hodgkins lymphoma. The patients have been classified into two groups, symptomatic and asymptomatic. Censoring information is recorded in the column cens where cens=0 means the patients is censored and cens=1 otherwise.

(a) Compare survival in the two groups using Kaplan-Meier estimates of the survivor function. Please plot your survival functions.

(b) For each group, present a 95% confidence interval for the probability of 6-year survival.

(c) For the symptomatic group, present an estimate of the time beyond which survival prob- ability is less than 30%.

(d) Plot the survival in the two groups using Nelson-Aalen estimates.  Are they similar to your Kaplan-Meier estimates? Why?

(e) What are the answers for Question (b) under the Nelson-Aalen estimate.

2.  (10 marks) The data in file duck .txt represent survival times (in days) after radio-tagging for 50 female black ducks.   Also recorded are an indicator of whether death was observed (1=observed, 0=censored) and three potential explanatory variables (age in years, weight in grams and length in cm).

(a) Investigate the dependence of survival on the explanatory variables using Weibull regres- sion models.

(b) Is the Weibull distribution a reasonable model?

(c) Investigate the dependence of survival on the explanatory variables using Cox proportional hazards models.

(d) Plot (on the same figure) the estimated survivor function for a one year old duck with weight 1000 and length 250, under your preferred model in each of parts (a) and (c).

3.  (12 marks) The data in file mortality .csv represent numbers of deaths and central exposed to risk for male and female members of a large pension scheme, for age (at last birthday) x = 60, 61, . . .

(a) Calculate the crude central mortality rates (mx ) for male and female pensioners, and compare log mx  for males and females, by plotting both sets of values on the same axes.

(b) Calculate the corresponding qx  values under both (i) constant force of mortality within each year of age, and (ii) uniform distribution of deaths within each year of age. Hence, calculate a life table (with 60  = 100, 000) for males and for females, under both assump- tions.  [In your report, it is sufficient to give the values of ℓx  at 5-year intervals, that is ℓ60 ,ℓ65 ,ℓ70 , . . ..]

(c) Calculate the complete and curtate life expectancies for males and females at age 60.

(d) For both males and females, use a formal statistical test to compare the death rates in this insured population with the whole population of England and Wales (you will need to download ELT17 from the ONS website).

(e) For the male population only, use a Gompertz log-linear model to produce a set of gradu- ated central mortality rates from the crude mortality data. Compare crude and graduated rates by by plotting both sets of log mx  values on the same axes.