Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

FAQs

1. What do I need to submit?

 You must submit two documents: (1) Part 1: the full report in PDF format (.pdf) containing the analysis code, all written answers with its corresponding R console outputs (using screenshots) and plots; (2) Part 2: the R script file (.R) with codes used for to answer the questions and generating all outputs.

2. When asked to provide codes for certain questions – must I write the full function or just type the test code we suppose to use?

 You must provide the code in full of the optional arguments

3. What is the maximum word limit or a max number of words per question?

 There is no word limit. For the statistical question - when writing your answers (especially for null & alternative hypotheses as well as giving interpretations) keep them as concise as much as possible – avoid waffling.

4. Should I treat all questions as separate entities?

 Yes, all major questions are self-contained and, therefore must be treated separately. However, the sub-questions within a major question (e.g., Question 1 (major) and, for instance, 1a, 1b, and 1c (sub-question(s))) must be answered cautiously  as an answer given to a previous sub-question can lead to its follow-up. For instance, make sure to provide the correct answer to Question 2 (b) because any incorrect answer in 2 (b) can potentially lead to a follow-up error in 2 (c).

The main trigger for food shortages and low yield for farmers in the Southern region of Africa is drought (i.e., prolonged periods of aridity or low rainfall). You are tasked with assessing the burden of food shortages from farmers and its relationship to malnutrition in these areas.

Instructions: Your full UCL student ID number represents the total number of produce farmers specialised in the fruit, vegetable and animal husbandry markets across the Southern region of Africa, and the last 4 digits of your UCL ID represent the number of produce farmers experiencing very low yield. Use this information to answer questions 1a and 1b.

Note: Your student ID number contains eight digits, and it should look something akin to these examples: 18020105 or 19012500. Using 19012500 as a motivating example to explain the above instruction: 19012500 (full ID) will represent the total number of produce farmers, and 2500 (last four digits) is the number of produce farmers with very low yield.

If the last four digits of your ID begin with a zero – for instance, 0105 from 18020105. You can choose to use the last three (105) or five digits (20105) instead to arrive at a number not starting with 0

a. What is the prevalence of food shortage in the Southern region of Africa (expressed as

%)? [1]

b. Develop a new function (you can name the function by yourself) in R that calculates the prevalence of food shortage. The function must express the result in percentage [2]

15 random samples were studied to assess the quantity of produce from low-yield farmers by multiplying the values to a factor variable using the last 3-digit values of your UCL ID number.

Farm ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Factor

0.07

0.41

0.73

0.28

0.25

0.34

0.39

0.26

0.16

0.33

0.30

0.66

0.56

0.17

0.48

Quantity of food (in

million tonne)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

c. Calculate the summary statistics for the quantity of food produced by these farmers (in million tonnes) and provide the interpretation of these descriptive measures. [3]

d. What are the best approaches for visualising the above data and provide a justification for your answer? Write R codes to generate the appropriate plots [4]

 10 marks

A geographical study design was used to compare the overall quantity of food production from 30, 46 and 32 farmers (in million tonnes) selected from Zambia, Zimbabwe, and Malawi, respectively.

a. Use the “Question_2.csv” to create a variable to answer the following questions 2b, 2c, 2d and 2e.

Use your full UCL ID number in the set.seed() function to begin creating a personalized column representing the quantity of food generated from the ‘sample’ variable in the Question_2.csv dataset. From a uniform distribution using the function runif()with the following parameters specified (n = 111, min = 1 & max =

5) to generate random values, and then subtract the generated values from the variable called “Sample” to create personalised values for the quantity of food. [1]

b. State the appropriate hypothesis for comparing differences in the distributions of food quantity across the three countries. [1]

c. What is the best methodology for testing this hypothesis? State the correct statistical test and provide a justification for choosing it. [4]

d. Write down the correct R code to compute the statistical test and p-value. [2]

e. Are there any differences across the three regions, and what conclusion can you draw from this analysis? [2]

 10 marks

100 patients from villages in the Southwest were admitted to a hospital in Plymouth due to heavy metal arsenic poisoning associated with long-term prolonged exposure to environmental low levels of arsenic from soils and drinking water.

Upon admission – their condition was critical as it turned out because they had abnormal levels of arsenic detected in blood samples (mg/ml). On the spot, the patients were cared for, and monitored round the clock on a 3-hourly basis until their condition became stable after a week. Blood samples were taken on a 3-hourly basis to monitor loads of toxicity for reduction to see if they were recovering.

The lab readings from the toxicological analysis are stored in “Question_3.csv”, if you multiply them with the last 2-digits of your UCL ID – the values become standardised.

On a patient level, you want to assess whether these patients are recovering well.

a. What is the hypothesis for determining whether patients are making a recovery? [2]

b. Write the code for personalising the dataset, briefly discuss some of the issues with the records in “Question_3.csv” and suggest what can be done to mitigate the issues. Apply the appropriate data cleaning to derive the desired format for answering the 3c accordingly [10]

c. Use the most appropriate methods for testing the hypothesis in 3a) and provide justifications for selecting the method. Write out the full R script for analysing the data and performing the statistical test [5]

d. What conclusions can you arrive with regards to these cohorts of patients – provide a full interpretation [3]

 20 marks

What were the impacts of the lockdown tier system and broader levels of deprivation on the employment index for 7,201 Middle Super Output Areas (MSOA) in England during the COVID-19 pandemic?

The dataset ‘Question_4.csv’ contains the following independent variables: Tiers (categorical with 1 = “low risk” and 2 = “high risk”) and Deprivation (continuous). The dependent variable, employment, is an estimation which must be corrected before answering the question.

To apply this correction, use the following steps:

● Use the full UCL number in the set.seed() function ensures your data is reproducible and personalised

● Create a personalised column using a normal distribution with n = 7,201, mean = 0 and standard deviation = 1.5 using the rnorm() function

● Replace the “estimate_employment” variable with the sum of personalized normal column and the original “estimate_employment” variable

a. Personalize the dataset based on the instructions given above, write the code to perform a multivariable linear regression model in R using the employment index as the dependent variable against deprivation and tier index as the independent variables.

Show the FULL results for model output and include the 95% confidence intervals. Provide a screenshot of the output. [6]

b. Provide a FULL interpretation for the regression coefficients of deprivation and tier variable and include the 95% confidence and whether this relationship is statistically significant or not. [10]

c. Construct the multivariable linear regression model. What is the predicted employment in a tier high risks area with an average deprivation score of 30? [4]

d. In your opinion - is this a good, poor or invalid model? Justify your answer [5] 25 marks

Councils in the East Midlands released high-resolution postcode data for deprivation index and estimates for employment for 506 postcodes. Use the data “Question_5.csv” to assess the impact of deprivation on employment.

To personalize your data:

● Use the full UCL number in the set.seed() function

● Create a personalized column from a normal distribution with n = 506, mean = 0 and standard deviation = 0.1053 using rnorm() function, updating the personalised column by adding the personalized data to the existing “Estimated_Employment” column in data “Question_5.csv”.

a. Create the personalised column and describe its overall relationship with deprivation. Is there anything peculiar about these two variables? [5]

b. Use a univariable regression model to assess the relationship between employment and deprivation (Hint: consider whether you need data transformation and give justifications). Also, use the appropriate parameters to construct a regression model [15]

c. Provide the approach interpretation for the regression parameter for deprivation [5]

d. Use a non-linear regression model with an inclusion of a quadratic term and compare the model performance with the model in 5b.

In your opinion, which model performed better? Justify your answer [10] 35 marks

Select the study design accordingly to answer this question. There are broadly 4 different study design types listed as Pilot, Ecological, Cross-sectional, and Longitudinal.

0 – 1 = Pilot

2 – 3 = Ecological study

4 – 6 = Cross-sectional study

7 – 9 = Longitudinal study

Instructions: Use your UCL student ID number to select two study designs to answer 6a. Using this ID number (18020155) as a motivating example – the fourth and sixth digits should fall in one of the defined ranges for the different study design types. For instance, the fourth digit in the above ID is ‘2’, select Ecological study. The sixth digit is ‘1’; therefore, select Pilot study.

a. Use the fourth and sixth digits of your UCL ID number to select two study design types to discuss five differences (if numbers give the same study – move to the next digit). Construct a table to contrast the selected study design types. [5]

b. Use the seventh digit of your UCL ID number to select a study design. Write a short proposal with 250 words for an outline for a quantitative study that explores the following topic:

 “Impact of low-level earth tremors on incident damages to household structures in rural communities in earthquake-prone zones in India.” [10]

c. Discuss the five problems that can arise from this type of study in question 6b. [10] 25 marks