IRDR0004 Coursework (2022/2023)
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
IRDR0004 Coursework (2022/2023)
Computer-Based R Exercise & Report
IRDR0004 Coursework Assessment
Instructions: You are required to submit a report answering all questions. The report must be submitted any time before 13:00 (UK) September 6, 2023. All written answers for questions, including scripts, outputs displayed on R Console (screenshots) and graphical plots should be written and saved as PDF file.
You are provided with the dataset to which you will further have to apply the algorithms to make the records personalised based on your UCL ID number so that each student has unique coursework.
Failure to comply with the above instructions or any documentation (i.e., script separate file/zip, fail to randomize your data) missing from the submission will lead to an automatic deduction of 20 marks.
The report must be submitted through the UCL Assessment system. Bear in mind that any detected plagiarism will result in zero marks for all involved, and disciplinary procedures will be followed as per UCL policy.
Please attempt to answer all questions. Throughout the coursework, you will be asked to generate/make data unique using your own ID number. Questions with higher difficulty yield more points.
FAQs
1. What do I need to submit?
You must submit two documents: (1) Part 1: the full report in PDFformat (.pdf) containing the analysis code, all written answers with its corresponding R console outputs (using screenshots) and plots; (2) Part 2: the R script file (.R) with codes used for to answer the questions and generating all outputs.
2. When asked to provide codes for certain questions – must I write the full function or just type the test code we suppose to use?
You must provide the code in full of the optional arguments
3. What is the maximum word limit or a max number of words per question?
There is no word limit. For the statistical question - when writing your answers (especially for null & alternative hypotheses as well as giving interpretations) keep them as concise as much as possible – avoid waffling.
4. Should I treat all questions as separate entities?
Yes, all major questions are self-contained and, therefore must be treated separately. However, the sub-questions within a major question (e.g., Question 1 (major) and, for instance, 1a, 1b, and 1c (sub-question(s))) must be answered cautiously – as an answer given to a previous sub-question can lead to its follow-up. For instance, make sure to provide the correct answer to Question 2 (b) because any incorrect answer in 2 (b) can potentially lead to afollow-up error in 2 (c).
Question 1
The Midlands region of England was struck hard by the COVID- 19 pandemic. You are tasked with assessing the burden of COVID- 19 in this area.
Instructions: Your full UCL student ID number represents the total number of inhabitants at risk of contracting COVID-19 in this region, and the last 4 digits of your UCL ID are the number of people who contracted the virus. Use this information to answer question 1a and 1b.
Note: Your student ID number contains eight digits, and it should look something akin to these examples: 18020105 or 19012500. Using 19012500 as a motivating example to explain the above instruction: 19012500 (full ID) will represent the total number of people at risk; and 2500 (last four digits) is the number of COVID-19 cases.
If the last four digits of your ID begins with a zero – for instance 0105 from 18020105. You can choose to use the last three (105) or five digits (20105) instead to arrive to a number not starting with 0
a. What is the incidence rate of COVID- 19 (express your answer as cases per 1,000,000)? [1]
b. Develop a new function (you can name the function by yourself) in R that calculates the incidence rate of COVID- 19. The function must express the result as cases per 1,000,000[2]
A random sample of 15 asymptomatic COVID-19 cases were studied and these were individuals from the Midlands and their viral load (expressed as log 10 copies per mL) were determined through laboratory analysis of saliva by multiplying the factor variable to the last 2-digit values ofyour UCL ID number.
Case ID |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
Factor |
0.07 |
0.41 |
0.73 |
0.28 |
0.25 |
0.34 |
0.39 |
0.26 |
0.16 |
0.33 |
0.30 |
0.66 |
0.56 |
0.17 |
0.48 |
Viral load (log10/mL) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
c. Calculate the summary statistics for the viral load (log10/mL) and provide the interpretation of these descriptive measures? [3]
d. What are the best approaches for visualising the above data and provide a justification for your answer? Write R codes to generate the appropriate plots [4]
Question Two
A geographical study design was adopted for comparing the age distributions among people with asymptomatic COVID- 19. 30, 46 and 32 individuals were selected from East Midlands, West Midlands, and East of England, respectively.
a. Use the “Question_2.csv” to create a variable to answer the following questions 2b, 2c, 2d and 2e.
Use your full UCL ID number in the set.seed() function to begin creating a personalized column representing age generated from the ‘Age’ variable in the Question_2.csv dataset. From a uniform distribution using the function runif() with the following parameters specified (n = 111, min = 1 & max = 5) to generate random values, and then add to the generated values from the variable called “Age” to create the personalised values for ages . [1]
b. State the appropriate hypothesis for comparing the distributions across the three regions? [1]
c. What is the best methodology for testing this hypothesis? State the correct statistical test and provide a justification for choosing it? [4]
d. Write down the correct R code to compute the statistical test and p-value. [2]
e. Are there any differences across the three regions, and what conclusion can you draw from this analysis? [2]
10 marks
Question Three
100 farmers from villages in the Malawi have been monitoring the seasonal trends in levels of food production (in million tonnes) during May to December 2020 where dryness can be severe. Lessons have been learnt from food shortage disaster in 2018 when they were hit hard and had very low quantities of food production. Measures were put in place in 2020 to protect their farms against such similar food shortage disaster outcome experienced in 2018.
The levels of food production from farmers (in million tonnes) are stored in “Question_3.csv”, if you multiply them with the last 2-digits of your UCL ID – the values become standardised.
Assess whether these measures have worked well.
a. What is the hypothesis for determining whether there any difference between the levels of foods produced in 2018 and 2020? [2]
b. Write the code for personalising the dataset, briefly discuss some of the issues with the records in “Question_3.csv” and suggest what can be done to mitigate such issue. Apply the appropriate data cleaning to derive the desired format to answer the 3c accordingly [10]
c. Use the most appropriate statistical methods for testing the hypothesis in 3a) and provide justifications for selecting the method. Write out the full R script for analysing the data and performing the statistical test [5]
d. What conclusions can you arrive with regards to these cohort of these farmers – provide a full interpretation [3]
20 marks
Question Four
An England-wide campaign was launched to target residential gardens to bring contamination levels of arsenic below the acceptable limits. They have treated garden soils in 7,201 rural and urban locations and documented the campaign’s impact. Soil arsenic contamination levels post campaign were measured, as well as the treatment index (which is measure used to represent the aggressiveness of the treatment campaign at those locations).
The dataset ‘Question_4.csv’ contains the following independent variables: Location type (categorical with 0 = “Rural” and 1 = “Urban”) and Treatment Index (continuous). The dependent variable, soil arsenic concentration, is an estimation which must be corrected before answering the question.
To apply the correction, use the following steps:
● Use full UCL number in the set.seed() function ensure your data is reproducible and personalised
● Create a personalised column using a normal distribution with n = 7,201, mean = 0 and standard deviation = 1.5 using the rnorm() function
● Replace the “soilAs_estimates” variable with the sum of personalized normal column and the original “soilAs_estimates” variable,
a. Personalize the dataset based on the instructions given above, write the code to perform a multivariable linear regression model in R using the soil arsenic as the dependent variable against location type and treatment index as the independent variables.
Show the FULL results for model output and include the 95% confidence intervals. Provide a screenshot of the output. [6]
b. Provide a FULL interpretation for the regression coefficients of location type and treatment index variable and include the 95% confidence and whether this relationship is statistically significant or not. [10]
c. Construct the multivariable linear regression model. What are the predicted levels of soil arsenic in gardens in urban locations when the treatment index score is 30? [4]
d. In your opinion - is this a good, poor or an invalid model? Justify your answer [5]
Question Five
One rural location in the East Midlands caught your interest as the soil estimates for arsenic are extremely volatile, and so the treatment campaign was redone but at a much higher resolution with postcodes of gardens to examine the impact of the treatment index on the changes in soil concentrations. Use the data “Question_5.csv” .
To personalize your data:
● Use full UCL number in the set.seed() function
● Create a personalized column from a normal distribution with n = 506, mean = 0 and standard deviation = 0. 1053 using rnorm() function, updating the personalised column by adding the personalized data to the existing “soilAs_change” column in data “Question_5.csv” .
a. Create the personalised column and describe its overall relationship with treatment index. Is there anything peculiar about these two variables? [5]
b. Use a univariable regression model to assess the relationship between soil arsenic and treatment index. (Hint: consider whether you need data transformation and give justifications). Also, use the appropriate parameters to construct a regression model [15]
c. Provide the approach interpretation for the regression parameter for treatment index [5]
d. Use a non-linear regression model with an inclusion of a quadratic term and compare the model performance with the model in 5b.
In your opinion, which model performed better? Justify your answer [10]
35 marks
Question Six
Select the study design accordingly to answer this question. There are broadly 4 different study design types listed as Pilot, Ecological, Cross-sectional, and Longitudinal.
0 – 1 = Pilot
2 – 3 = Ecological study
4 – 6 = Cross-sectional study
7 – 9 = Longitudinal study
Instructions: Use your UCL student ID number to select two study designs to answer 6a. Using this ID number (18020155) as a motivating example – the fourth and sixth digit should fallin one of the defined ranges for the different study design types. For instance, the fourth digit in the above ID is ‘2’, select Ecological study. The sixth digit is ‘1’, therefore select Pilot study.
a. Use the fourth and sixth digit ofyour UCL ID number to select two study design types to discuss five differences (if numbers give the same study – move to next digit). Construct a table to contrast the selected study design types. [5]
b. Use the seventh digit of your UCL ID number to select a study design. Write a short proposal with 250 words for an outline for a quantitative study that explores the following topic:
“Impact of surface water floods and risk of physical injuries in rural communities near large water bodies (i.e., rivers and lakes).” [10]
c. Discuss the five problems that can arise from this type of study in question 6b? [10]
25 marks
2023-09-04
Computer-Based R Exercise & Report