Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT 380 Data Science Through Statistical Reasoning and Computation

Spring 2023

Exam 1 - Take-home

Acknowledgements: By uploading your work for the take-home in Canvas, you are acknowledging the following:

•   I acknowledge that giving help to any other person during the exam, receiving help from any other person during the exam, or working with any other person on exam problems while      taking the exam is grounds for an academic integrity violation.  "Any other person" refers to anyone NOTjust a classmate.

•   I acknowledge that posting any content from this exam or posting any questions for help on this exam to any online site (e.g.,  StatExchange, CourseHero, Chegg, GroupMe, ChatGPT, etc.) is grounds for an academic integrity violation.

•   I acknowledge that copying answers from any source is grounds for an academic integrity violation.

•   I acknowledge that sharing any or all parts of this exam with any other person, classmate or otherwise, is grounds for an academic integrity violation.

Set up: (5 Points)

•   Create a new R Markdown by going to FILE >> NEW FILE >> R Markdown.

•    Modify the YAML header so that

o the title is STAT 380 Exam 1 Take-home

o there is a line for the author with your name specified

o there is a date

•    Delete the default text.

•   Add a Front Matter section. Use a level 2 heading and add libraries as needed. All library      commands should be in the Front Matter section. Also, include any commands for reading in the dataset in the Front Matter section.

•   Name the file LastnameFirstinitial_STAT380_Exam1TH.

•   For each problem, you should add a level 2 heading with the task number (e.g., Problem 1 Part a, Problem 1 Part b, etc.). Then, write code and answer the questions as necessary.

Data: The problems on this exam use the file LifeExpectancyData.csv. The Global Health          Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. LifeExpectancyData.csv       combines the data from the WHO along with economic data from the United Nations website.

The dataset contains the following variables:

•   Country

   Year

•    Status Indicates whether the county is a developed or a developing country

•    Life.expectancy The country’s life expectancy in years

•    Adult.mortality Number of adult deaths per 1000 population (adult = ages 15-60)

•    Infant.deaths Number of infant deaths per 1000 population

•    Alcohol Alcohol consumption per person (in liters)

•   Percentage.expenditure Expenditure on health as a percentage of Gross Domestic Product per capita(%)

•    Hepatitis.B Percentage of 1-year-olds immunized for Hepatitis B

•    Measles Number of reported cases by 1000 people

•    BMI Average Body Mass Index of entire population

•    Under.five.deaths Number of child (less than 5 years old) deaths per 1000 population

•    Polio - Percentage of 1-year-olds immunized for Polio

•    Total.expenditure Percentage of total government expenditure that goes to health

•   Diphtheria Percentage of 1-year-olds immunized for Diphtheria tetanus toxoid and pertussis (DTP3)

•    HIV.AIDS Deaths per 1000 live births from HIV/AIDS (0-4 years)

•    GDP Gross Domestic Product per capita (in USD)

•    Population Population of the country

•   Thinness. 10. 19.years Percentage of children and adolescents (10- 19 years) exhibiting thinness

•    Thinness.5.9.years Percentage of children exhibiting thinness

•   Income.composition.of.resources Human Development Index in terms of income composition of resources (index ranging from 0 to 1)

•    Schooling Average number ofyears of schooling

NOTE: When reading in this dataset, make sure the variables names are Country, Years, Status, etc. instead of V1, V2, V3, …

a.   To solve this issue, click the box for Yes for Heading in the Import Dataset interface.

b.   The image below, although unrelated to this dataset, demonstrates where to change the Heading option.


Problems:

1. (8 Points) Exploration. Suppose we are interested in understanding factors related to the life expectancy variable.

a. Read the variable descriptions and identify 3 variables that you believe will be associated       with life expectancy. (Do not use Country or Year.) There is no right or wrong answer to     this question; however, to answer this question, write a paragraph that identifies the 3           variables you have picked and explain your reasoning as to why you think the variables will related to life expectancy.

b. For each of the variables you have chosen, create an appropriate (and properly labeled) plot  that shows the relationship between life expectancy and the variable. Then, write a sentence or two explaining what you have learned about the relationship based on the plot. NOTE:    For this question, you should have 3 plots and 3 explanations.

c. Examine the data for Australia. (You can write code to explore or simply look through all of  the columns.) Write a paragraph discussing concerns you have about the correctness of the   data. Provide at least two specific examples of a value/variable that you think to be incorrect and explain your reasoning.

2. (12 Points) Prediction. Suppose we are interested in predicting the life expectancy as accurately   as possible using Adult.mortality, Infant.deaths, GDP, Total.expenditure, BMI, and Status. The goal of this problem is to compare the predictive ability of a multiple linear regression model to that of a kNN regression based on test set performance. Both models should be based on the same variables.

a. Subset the data to only include the seven variables mentioned above and remove any observations with missing data (NA’s) in the seven variables.

b. Write a paragraph explaining the steps you will take to perform the analysis. This should        include a description of the data processing that you plan to do, the decisions you will make regarding the training/testing split, how you plan to pick “k” in kNN, and the metric you     will use to compare the predictive ability of the methods on the test set. (Unlike class where I tell you step-by-step what to do and what decisions to make, this question is trying to get  you to articulate the steps/decisions.)

c. Implement the analysis described in Part b. Be sure to include a plot that can be used to justify your choice of k” for kNN.

d. Based on your implementation, which model had better predictive ability based on the test set? Explain your reasoning. (Be sure to mention the value of “k” that you have chosen  include values for test set performance in this explanation.)

3. ( 10 Points) Inference. Suppose we are interested in understanding the association between adult mortality, the country’s status (developing or developed), and life expectancy.

a. Starting with the full dataset, subset the data to only include the three variables mentioned above and remove any observations with missing data (NA’s) in the three variables.

b. Build a multiple linear regression model for life expectancy that uses adult mortality and   status. Run the summary() function on the model, report the estimated equation, and      interpret the coefficients (partial slopes) associated with adult mortality and status. Note, you do not have to perform a training/testing split for this problem.

c. Instead of a reporting a point estimate (single number) for the regression coefficient of  Adult.mortality, we often use confidence intervals. In this problem, we will explore a computational procedure for estimate a 95% confidence interval for the coefficient of Adult.mortality. Remember that you can think of the confidence interval as a range of plausible values for the estimated quantity. Here is the logic:

i.      Generate a bootstrap sample of the data. To do this, generate a random sample of n rows from a dataset of n rows by sampling with replacement.

ii.      Build the model on the bootstrap sample (Train_boot) and store the estimated     coefficient. The code below is helpful for extracting coefficients. The given code extracts the first estimated coefficient (the y-intercept). Changing the [1] to [2]   extracts the second estimated coefficient (i.e., the first partial slope). And so on.

 

iii.      Repeat this procedure a large number of times, say 1000 so that you have 1000 estimates of the coefficient.

iv.      Find the values that cut off the middle 95% of the values. The code below shows how you would find the values that cut off the middle 95% of the values in the   vector `beta_vec`.

 

Instruction for Part c: Find a 95% confidence interval for the regression coefficient of       Adult.mortality by implementing the procedure described above. Write a loop that repeats steps i. and ii. 1000 times. Set a seed (of your choice) before the loop., not inside the loop. You can place set.seed(NULL) after the loop (not inside). Once the loop is done, plot a     histogram of the estimated values and perform step iv.