Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Institute of Epidemiology and Health Care

IEHC0080 – Data analysis for Population Health 2023-24 Assessment

This assessment covers the following learning outcomes:

Knowledge: selecting and using correct statistical methods; analysis of simple datasets using both categorical and continuous variables.

Skills: data handling and use of statistical software (Stata); using appropriate tests, figures and summaries of results; presenting the findings in a clear way; interpreting statistical results; linking findings to specific research questions.

Before starting the task, please read all the information carefully.

Instruction for completing and submitting the IEHC0080 assessment

This component accounts for 90% of your final module mark. Read the below guidelines to avoid losing unnecessary marks.

§ The assessment is due by 12 Jan 2024 Midday. Please follow the submission guidelines. The submission guidelines are available on the Moodle page for this module.

§ The word limit is 2,000 words in total, excluding front cover, questions, but including tables. Part E (see below) does not count to word limit.

§ Please put your candidate number on the front cover of the essay and as the name of the file, (NOT your name or your Student ID), to enable anonymous marking.

§ This is an assessed piece of coursework for the IEHC0080 module; collaboration and/or discussion of any section of this assessment with anyone else is strictly prohibited.

§ Academic integrity (including plagiarism and collusion) is taken extremely seriously and can disqualify you from the module or course (for details of what constitutes plagiarism see http://www.ucl.ac.uk/current-students/guidelines/plagiarism).  If you are in doubt about any of this, please ask the tutors.

§ Clarification of the questions should be addressed on the Moodle Forum. As this is an assessed piece of work, you should not email/ask the module/personal tutors any questions about how to answer the questions.

Guidelines for completing and submitting IEHC0080 assessment

§ The essay part of questions comprises four sections (Part A-D) and one appendix (Part E). Complete all parts of each section.

§ You need to submit: (1) a written report  (Part A-D) including tables and figures, and (2) an appended (e.g. copy and pasted) do file (Part E) that shows all syntax needed to answer the questions A-D. Your do file accounts for 15% of the mark. Ensure your do file is error free.

§ Where appropriate, answers should be written in complete sentences, describing the methodology, analytical approach, results and conclusions.   

§ You do not need to cite or discuss any relevant literature, but you need to name statistical methods and concepts used in the analysis, such as “t-test” or “Pearson’s correlation”

§ You should discuss the interpretation of your results and how they relate to the questions you are asked. You do not need to describe an exact definition of each statistical analysis,  however you need to explain reasons why you used particular method for particular part of this assessment.

§ In your essay, you should include up to 3 tables and up to 3 graphs alongside your written answers. You should NOT present Stata code in the main body of your report. Stata code should appear only in Part E of your report.

DOs and DON’Ts

· DON’T include raw variable names in the text or tables

· DON’T include unedited log file/screenshots of the outputs in this report.  You will lose marks by doing so.

· DON’T report p= 0.000 despite your Stata output may show such number. Report the exact p-values where appropriate.

· DO structure your essay in the same order as exam’s questions are

· DO use Stata to answer the questions.

· DO use 2 decimal points for reporting odds ratios or risk ratios, max 3 decimal places when reporting regression coefficients, and 3 decimal places for reporting p-values

· DO make sure tables and figures have titles and are referred to in the text

· DO make sure your tables and figures are self-explanatory. Include a clear title and add notes if needed.

· DO make sure figures in your tables are reproducible, for example the sum of sub-group sizes is the same as the study sample size.

Dataset:

For this data analysis exercise, you are asked to analyse data from a recent study of physical health in the sample of middle and older age men and women. Investigators collected data from population sample, random subset of which you have available for this assessment. Respondents were asked a range of questions and participated in a short examination. You are given a small extract of data from the survey:

§ A codebook for the variables used in this assessment is provided below:

Variable name 

Variable label and description

sbp

Systolic blood pressure in mmHg

bmi

Body mass index in kg/m2

age

Age of participant (collapsed at 90+)

sex

Sex of the participant (1=men, 2=women)

chol

Total cholesterol in mmol/l

smok

Whether ever smoked cigarettes

physact

Physical activity

sclass

Social class

srh

Self-reported general health

depres

Self-reported occurrence of depressive symptoms

Report brief:

You have been tasked to explore an association between the body mass measured by BMI and systolic blood pressure of the study participants.

To examine the association, you are expected to use the BMI measure (bmi, independent variable) which you will firstly categorise to binary variable, and systolic blood pressure (sbp, dependent variable), along with age and sex (which must be part of your final model as a priori selected covariates) and other characteristics of the participants available in this dataset in regression analyses.

To demonstrate your knowledge of basic statistical methods and model building, we also provided 6 other variables social class, total cholesterol, physical activity, smoking status, self-reported general health, and depressive symptoms  for you to consider adding to your model.

You should start your report with an introduction describing the aim of the work when you start answering Part A. Your report should end with a concluding remark, summarising the main findings. This summary should be no more than 100 words, and will be included as a part of response required in answering Part D of this assessment.

To answer the questions, you need to recode the body mass variable (bmi) and other variables according to the instruction below:

§ Body mass: generate bmi_bin variable which will created only from valid values of original variable bmi, and will be grouped as follows:

those who are underweight and normal weight (BMI 24.9 or less),

and those who are classified as overweight, obese and severely obese (BMI 25.0 or more),

· All variables: You must check how missing values are coded. All missing values should be recoded as ‘.’

· Your analysis in Part B-D should be based on complete case dataset, i.e. analytical sample of individuals with no missing data for any variable.

· Part A: The aim and description of the dataset (15 points)

Start your report by stating the main aim of the analysis and this report.

Describe data types and report distribution patterns, e.g. frequency distribution, characteristics of central tendency or number of missing data of the relevant variables in the study sample. Present frequency distribution of the categorical variables of your analytical sample (i.e. complete cases) with a tabular output. You need to describe the distribution patterns of continuous variables through assessing empirical measures such as skewness, kurtosis, median and mean.

You need to explain how you recategorized/cleaned specific variables when needed. For later analysis, please create binary body mass index (bmi_bin) variable as described above, and which will be used in final section of the analysis.

After you describe missingness in your dataset you should also state the sample size of your analytical sample prepared for complete case analysis.

· Part B: Presenting a statistical hypothesis (5 points)

Taking original body mass (bmi) as an exposure (independent variable) and level of systolic blood pressure (sbp) as an outcome (dependent variable), formulate a statistical hypothesis which you will later test in Part C. Answers should include both null and alternate hypotheses

Your hypotheses will be later used in Part D where you will make statement about your results, and whether they support or reject your null hypothesis

Note: Your answer should not exceed more than 50 words.

· Part C: Testing associations (35 points in total):

(a) (15 points)

Using appropriate tests (t-test, ANOVA, Pearson’s r, or Chi-squared test according to the data type), explore the associations:

(1) between covariates (age, sex, social class, total cholesterol, physical activity, smoking, self-rated general health and depressive symptoms) and the independent variable (body mass index, use bmi as continuous variable),

and

(2) between covariates (age, sex, social class, total cholesterol, physical activity, smoking, self-rated general health and depressive symptoms)  and main dependent variable (level of systolic blood pressure, sbp)  

to establish which covariates may act as confounding variables or effect modifiers

Explain your choice of the statistical methods used in your analysis.  

Note: Variables age and sex will be present in later models regardless of results from Part C (a).

Please describe your decisions for inclusion of variables in final analytical model in the form of a table and accompanying text.  

You are expected to examine statistical significance of associations specified above. Using above mentioned tests you should make dicisions whether there are any potential confounding variables in the data, and to decide whether variables will be included in multivariable analytical model or not.

A table, you are expected to include, should show appropriate test statistics (such as group mean values from ANOVA or t-tests) with the levels of significance.

(b) (20 points)

Test your hypothesis through examining your analytical model in steps (such as crude association or full model). This analysis should include the variables identified as potential confounders or effect modifiers in Part C(a), applying appropriate statistical tests.

For this part of the analysis use dichotomised body mass index variable bmi_bin.

Note: Variables age and sex should be included in the model(s) regardless of results from Part C (a).

Explain your choice of the statistical method(s), including testing of the essential assumption(s), for example linear association for the chosen statistical analysis.

Based on the estimates in the Stata output, interpret the results and model fit.  Present the results with a tabular output. You are expected to present the result (estimates, 95% CI, p-value) from appropriate regression models (including the crude association as a starting point of your analysis).

· Part D: Reporting interpretation of the results and research implications  (20 points)

Comment on the associations in terms of direction, magnitude of the effect and strengths of the association, and report whether they support or reject your hypothesis. Also comment on a broader limitation(s) of your model. Summarize and conclude your findings.

· Part E: Do-File (15 points)

Submit an appended do-file which can be run to generate the findings in your report. All syntax should be workable and a syntax to start a log file has to be placed before any analyses, and the syntax to close the log file should be placed at the end. Description of the syntax should be also present in your do file.

What should be included:

Comment each section appropriately;

-explore the dataset;

-summarize and describe the dataset;

-show recoding of bmi, 

-show how to deal with missing values;

-show appropriate testing for confounders/effect modifiers;

-show the use of appropriate regression for crude association;

-show the commands for appropriate graphic presentation of testing the association between some of characteristics (when relevant);

-show syntax for appropriate model building in regression stage of the analysis and for the final model.

Presentation:  (10 points)

You will be awarded points for clarity of answers, formatting and clarity of your tables and quality and clarity of your graphs.  

8-10 correct answers with outputs shown in concise and clear format

5-7 correct answers with outputs that can be difficult to understand

0-4 unclear answers, text difficult to understand with unclear tables and graphs.