Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework 1

Biostatistics I

Due 11:59pm on September 19, 2022

General Instructions

The first step in any analysis is understanding your data.  In this assignment you will practice loading a dataset, creating a variety of figures, and producing numerical summaries.

Use the texts and Google as references.

Setup R

• Begin with a new folder that will be used only for this assignment, including raw data, R markdown, and all output. Create an appropriately named R markdown file. A template will be provided as an example. Make sure you include your name and date you knited the document. For R scripts within the R markdown file, add comments or short description to the code so that it is easy to follow.

• This system is crucial to keep your analyses organized - you may be using the same dataset and similar analyses for different projects, and it is easy to accidentally save them in the wrong folder!

• The next thing in your file should be the packages you load for this session. As in class, we will use tidyverse to load a collection of packages that will be useful. If at a later point in the assignment you find yourself in need of other packages, be sure to load them here, and not buried in the middle of the code.

# library(tidyverse)

• The next step is to load the data used for the assignments.

## Load data

NHANES glycohemoglobin data

For this assignment you will be using NHANES data. You will use techniques learned this week to learn and

describe the data. Download the NHANES glycohemoglobin data” dataset from https://hbiostat.org/data/. Read about the description of the variables that is provided on the website. Note that the tx and dx variables are coded as 0 and 1. A 0 means the patient does not have the condition and a 1 means s/he does have the condition.  Save the dataset in your current directory (folder), and use the correct R function to load the data into your session. If you use RStudio’s dropdown menu to import, be sure to keep the code so you can recreate the analysis later.

Take a moment to familiarize yourself with the data and variables available, and recode categorical variables using the labels instead of the numbers.

In all figures, be sure to have all components of an effective figure, including title and labels,

and color when appropriate. Please note that you can control the rendered figure size by setting fig.width and fig.height using knitr chunk options in the code chunk or at the YAML

section of the Rmarkdown document.

Question 1. BMI

Question 1a. Histogram of BMI

Make a histogram of the BMI values of individuals in the dataset.

Question 1b. Overlaid histogram of BMI by diabetes diagnosis

Make an overlaid histogram that colors the BMI of diabetic or pre-diabetic individuals (dx) differently from the BMI of individuals that are not diabetic/pre-diabetic. This essentially yields two histograms on the sample plot.

You will need to use the argument position="identity". If you do not use this parameter, the histograms will be stacked.

Question 1c. Stacked histogram of BMI and diabetes diagnosis

Make a stacked histogram that colors the BMI of diabetic or pre-diabetic individuals (dx) differently from the BMI of individuals that are not diabetic/pre-diabetic.

Now change the argument position="stack".

Question 1d. Understanding graphing methods by comparing plots

Describe the difference between the two plots in 1b) and 1c).

Question 1e. Describing an observed distribution

Write a few sentences that describes the distribution of BMI

Question 1f. Density plot of BMI by diabetes diagnosis

Make a density plot (not stacked) that colors the BMI values of diabetic/pre-diabetic individuals differently from the BMI of individuals who are not diabetic or pre-diabetic.

Question 1g. Understanding histograms and density plots

List two advantages of using a superimposed histogram plot (question 1b) and two advantages of using superimposed density plots (question 1f)

Question 2. BMI by sex

Question 2a. Density plot of BMI

Make a density plot of the BMI data.

Question 2b. Density plots of BMI by sex

Make a density plot of the BMI data with a different curve for females and males.

For this question, you should use different colors for the different densities and fill in the plot. Also control the transparency so that you can see both plots.

Question 2c. Describing observed distributions

Describe the distributions of BMI for males and females.

Question 3. BMI categories by race/ethnicity

Although BMI is a continuous variable, it often is analyzed by categories defined as the following: normal when BMI < 25, overweight when 25 ≤ BMI < 30, and obese when BMI ≥ 30. For this question, you are asked to analyze BMI as this categorical variable (you will need to create a new variable).

Question 3a. Side-y-side bar plot

Create a side-by-side bar plot showing the proportion of individuals in each of the BMI

categories by race/ethnicity.

• For this you would want to create a categorical BMI variable (called bmicat) and use ggplot(mapping =  aes(str_wrap(re,10),  fill  =  bmicat)).

• Make each category a different color to help distinguish them, and ensure labels are clearly legible.

• All bars should be side-by-side and not stacked.

Question 3b. Describing data based on a plot

Summarize what you see in the bar plot.

Question 4. Glycohemoglobin by BMI categories

Question 4a. Boxplots

Make side-by-side boxplots of glycohemoglobin the BMI categories.


Question 4b. Numeric summaries

Create a table that contains each of the following values of the glycohemoglobin for each BMI category: mean, median, min, max, standard deviation, and IQR.



Question 4c. Describing data

Describe how glycohemoglobin differs among the different BMI categories based on both

graphical and numeric summaries of data