Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

IRDR0004 Coursework (2022/2023)

Computer-Based R Exercise & Report

IRDR0004 Coursework Assessment

Instructions: You are required to submit a report answering all questions. The report must be submitted any time before 13:00 (UK) September 6, 2023. All written answers for questions, including scripts, outputs displayed on R Console (screenshots) and graphical plots should be written and saved as PDF file.

You are provided with the dataset to which you will further have to apply the algorithms to make the records personalised based on your UCL ID number so that each student has unique coursework.

Failure  to  comply  with  the  above instructions or any documentation  (i.e.,  script separate file/zip, fail to randomize your data) missing from the submission will lead to an automatic deduction of 20 marks.

The report must be submitted through the UCL Assessment system. Bear in mind that any detected plagiarism will result in zero marks for all involved, and disciplinary procedures will be followed as per UCL policy.

Please attempt to answer all questions. Throughout the coursework, you will be asked to generate/make data unique using your own ID number. Questions with higher difficulty yield more points.

FAQs

1.   What do I need to submit?

You must submit two documents: (1) Part 1: the full report in PDFformat (.pdf) containing the  analysis  code,  all  written answers with its corresponding R console outputs (using screenshots) and plots; (2) Part 2: the R script file (.R) with codes used for to answer the questions and generating all outputs.

2.   When asked to provide codes for certain questions – must I write the full function or just type the test code we suppose to use?

You must provide the code in full of the optional arguments

3.   What is the maximum word limit or a max number of words per question?

There is no word limit. For the statistical question - when writing your answers (especially for null & alternative hypotheses as well as giving interpretations) keep them as concise as much as possible  avoid waffling.

4.   Should I treat all questions as separate entities?

Yes,  all  major  questions  are  self-contained  and,  therefore must  be treated  separately. However,  the  sub-questions within a major question (e.g., Question 1 (major) and, for instance, 1a, 1b, and 1c (sub-question(s))) must be answered cautiously  as an answer given  to  a previous sub-question can lead to its follow-up.  For instance, make sure to provide the correct answer to Question 2 (b) because any incorrect answer in 2 (b) can potentially lead to afollow-up error in 2 (c).

Question 1

The Midlands region of England was struck hard by the COVID- 19 pandemic. You are tasked with assessing the burden of COVID- 19 in this area.

Instructions: Your full UCL student ID number represents the total number of inhabitants at risk of contracting COVID-19 in this region, and the last 4 digits of your UCL ID are the number of people who contracted the virus. Use this information to answer question 1a and 1b.

Note: Your student ID number contains eight digits, and it should look something akin to these examples: 18020105 or 19012500. Using 19012500 as a motivating example to explain the above instruction:  19012500 (full ID) will represent the total number of people at risk; and 2500 (last four digits) is the number of COVID-19 cases.

If the last four digits of your ID begins with a zero – for instance 0105 from 18020105. You can choose to use the last three (105) or five digits (20105) instead to arrive to a number not starting with 0

a.   What is the incidence rate of COVID- 19 (express your answer as cases per  1,000,000)? [1]

b.   Develop a new function (you can name the function by yourself) in R that calculates the incidence  rate  of  COVID- 19.     The   function  must  express  the  result  as  cases  per 1,000,000[2]

A  random  sample  of  15  asymptomatic  COVID-19  cases  were  studied  and  these  were individuals from the Midlands and their viral load (expressed as log  10  copies per mL) were determined through laboratory analysis of saliva by multiplying the factor variable to the last 2-digit values ofyour UCL ID number.

Case ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Factor

0.07

0.41

0.73

0.28

0.25

0.34

0.39

0.26

0.16

0.33

0.30

0.66

0.56

0.17

0.48

Viral load

(log10/mL)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

c.   Calculate   the   summary   statistics   for   the   viral   load   (log10/mL)   and  provide  the interpretation of these descriptive measures? [3]

d.   What are the best approaches for visualising the above data and provide a justification for your answer? Write R codes to generate the appropriate plots [4]

Question Two

A geographical study design was adopted for comparing the age distributions among people with asymptomatic COVID- 19. 30, 46 and 32 individuals were selected from East Midlands, West Midlands, and East of England, respectively.

a.   Use the “Question_2.csv” to create a variable to answer the following questions 2b, 2c, 2d and 2e.

Use  your  full  UCL  ID  number  in  the   set.seed() function  to  begin  creating  a personalized   column   representing   age   generated   from   the   ‘Age’   variable   in   the Question_2.csv dataset. From a uniform distribution using the function runif() with the following parameters specified (n = 111, min = 1 & max = 5)  to generate random values, and  then  add  to  the  generated  values  from  the  variable  called  “Age”  to  create  the personalised values for ages . [1]

b.   State the appropriate hypothesis for comparing the distributions across the three regions? [1]

c.   What is the best methodology for testing this hypothesis? State the correct statistical test and provide a justification for choosing it? [4]

d.  Write down the correct R code to compute the statistical test and p-value. [2]

e.   Are  there  any differences across the three regions, and what conclusion can you draw from this analysis? [2]

10 marks

Question Three

100 farmers from villages in the Malawi have been monitoring the seasonal trends in levels of food production (in million tonnes) during May to December 2020 where dryness can be severe. Lessons have been learnt from food shortage disaster in 2018 when they were hit hard and had very low quantities of food production. Measures were put in place in 2020 to protect their farms against such similar food shortage disaster outcome experienced in 2018.

The   levels    of   food    production   from    farmers    (in   million    tonnes)   are    stored   in “Question_3.csv”, if you multiply them with the last 2-digits of your UCL ID the values become standardised.

Assess whether these measures have worked well.

a.   What is the hypothesis for determining whether there any difference between the levels of foods produced in 2018 and 2020? [2]

b.   Write the code for personalising the dataset, briefly discuss some of the issues with the records in “Question_3.csv” and suggest what can be done to mitigate such issue. Apply the appropriate data cleaning to derive the desired format to answer the 3c accordingly [10]

c.   Use the most appropriate statistical methods for testing the hypothesis in 3a) and provide justifications for selecting the method. Write out the full R script for analysing the data and performing the statistical test [5]

d.   What conclusions can you arrive with regards to these cohort of these farmers – provide a full interpretation [3]

20 marks

Question Four

An England-wide campaign was launched to target residential gardens to bring contamination levels of arsenic below the acceptable limits. They have treated garden soils in 7,201 rural and urban locations and documented the campaign’s impact. Soil arsenic contamination levels post campaign were measured, as well as the treatment index (which is measure used to represent the aggressiveness of the treatment campaign at those locations).

The dataset  ‘Question_4.csv’ contains the following independent variables: Location type (categorical  with  0  =  “Rural”  and  1  =  “Urban”)  and  Treatment  Index  (continuous).  The dependent  variable,  soil  arsenic  concentration,  is  an  estimation  which  must  be  corrected before answering the question.

To apply the correction, use the following steps:

●   Use full UCL number in the set.seed() function ensure your data is reproducible and personalised

●   Create  a personalised column using a normal distribution with n = 7,201, mean = 0 and standard deviation = 1.5 using the rnorm() function

●   Replace the “soilAs_estimates” variable with the sum of personalized normal column and the original “soilAs_estimates” variable,

a.   Personalize the dataset based on the instructions given above, write the code to perform a multivariable linear regression model in R using the soil arsenic as the dependent variable against location type and treatment index as the independent variables.

Show  the  FULL  results  for  model  output  and  include  the  95%  confidence  intervals. Provide a screenshot of the output. [6]

b.   Provide  a  FULL  interpretation  for  the  regression  coefficients  of  location  type  and treatment index variable and include the 95% confidence and whether this relationship is statistically significant or not. [10]

c.   Construct the multivariable linear regression model. What are the predicted levels of soil arsenic in gardens in urban locations when the treatment index score is 30? [4]

d.   In your opinion - is this a good, poor or an invalid model? Justify your answer [5]

Question Five

One rural location in the East Midlands caught your interest as the soil estimates for arsenic are  extremely  volatile,  and  so  the  treatment  campaign  was  redone but at a much higher resolution with postcodes of gardens to examine the impact of the treatment index on the changes in soil concentrations. Use the data “Question_5.csv” .

To personalize your data:

●   Use full UCL number in the set.seed() function

●   Create  a  personalized  column from a normal distribution with n = 506, mean = 0 and standard deviation = 0. 1053 using rnorm() function, updating the personalised column by  adding  the  personalized  data  to  the  existing  “soilAs_change”  column  in  data “Question_5.csv” .

a.   Create the personalised column and describe its overall relationship with treatment index. Is there anything peculiar about these two variables? [5]

b.   Use  a univariable regression  model to assess the relationship between soil arsenic and treatment  index.  (Hint:  consider  whether  you  need  data  transformation  and  give justifications).  Also, use the appropriate parameters to construct a regression model [15]

c.   Provide the approach interpretation for the regression parameter for treatment index  [5]

d.   Use a non-linear regression model with an inclusion of a quadratic term and compare the model performance with the model in 5b.

In your opinion, which model performed better? Justify your answer [10]

35 marks

Question Six

Select the study design accordingly to answer this question. There are broadly 4 different study design types listed as Pilot, Ecological, Cross-sectional, and Longitudinal.

0 – 1 = Pilot

2 – 3 = Ecological study

4 – 6 = Cross-sectional study

7 – 9 = Longitudinal study

Instructions: Use your UCL student ID number to select two study designs to answer 6a. Using this ID number (18020155) as a motivating example – the fourth and sixth digit should fallin one of the defined ranges for the different study design types. For instance, the fourth digit in the above ID is ‘2’, select Ecological study. The sixth digit is ‘1’, therefore select Pilot study.

a.   Use the fourth and sixth digit ofyour UCL ID number to select two study design types to discuss five differences (if numbers give the same study – move to next digit). Construct a table to contrast the selected study design types. [5]

b.   Use the  seventh digit of your UCL ID number to select a study design. Write a short proposal with 250 words for an outline for a quantitative study that explores the following topic:

Impact of surface water floods and risk of physical injuries in rural communities near large water bodies (i.e., rivers and lakes).” [10]

c.   Discuss the five problems that can arise from this type of study in question 6b? [10]

25 marks