Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ASSIGNMENT 1

(Each part carries equal marks. Sub-question carries equal marks.)

The National Health Interview Survey (NHIS) is an annual survey conducted by the Department of Health and Human Services to measure the state of health in the country. The National Death Index is a master database of everyone that has died in the US including their Social Security number. Using detailed identifiable information about respondents to the NHIS, starting in the 1980s,  respondents to the  NHIS were  matched to the  National  Death  Index and the file then included a variable that identified when and if a person died within a follow-up period. The data also includes the cause of death. This merged data is called the NHIS Multiple Cause of Death data  (MCOD).  I  have  constructed  a  data  set  that   has  respondents  aged  25-64  from  the NHIS/MCOD data. Below is a table that describes the variables in the data set.

Variables

Description

diedin5

Dummy variable, =1 if the respondent died within 5 years of the survey, =0 otherwise

Male

Dummy variable, =1 if the respondent is male, =0 otherwise

Age

Age in years

Married

Dummy variable, =1 if the respondent is married, =0 otherwise

Race

Categorical variable for race and ethnicity. =1 if respondent is white, non-Hispanic, =2 if black, non-Hispanic, =3 if other

race, non-Hispanic, =4 if Hispanic

Educ

Categorical variable for educational level. =1 if respondent

has less than a high school degree, =2 if a high school degree, =3 if some college, =4 with a bachelors degree or more

Incomeg

Categorical variable for family income. =1 if family income is ≤$10K, =2 if >$10K and ≤$20K, =3 if >$20K and ≤$30K, =4 if >$30K and ≤$40K, =5 if >$40K and ≤$50K, =6 if >$50K.

Bmi

Body mass index, weight in kg/ height in cm squared

Srhealth

Categorical variable for self-reported health status. =1 if

excellent health, =2 if very good, =3 if good, =4 if fair =5 if poor.

1) Start with some data description

Create a  unique dataset.  Drop  parts of data according to the  last four digits of your  student number (i.e., your student ID is 40271234, please drop the first 1234 observations; if your student ID of 40003999, please drop the first 3999 observations, etc.).

i.      Browse the dataset. What is the data type?

ii. Summarizethe data and provide comments about the descriptive statistics.

iii.    Check the distributions of variables. Are they normal distributions? Why? Are there signs of possible outliers?

iv.    Some of the variables are categorical variables. Use the “tab” command to have an idea of what fraction of the sample died in the 5 years after the survey? Comment on the distribution of diedin5 (one/two lines).

v.     Use again the “tab” command to calculate the fraction of “diedin5” by education (”educ”) first  and then  by  income  (incomeg)  groups  (this  is  a  cross-tabulation). Provide  some  intuition   (one/two lines)  on  the  distribution  of   mortality  rate  by

education and income (i.e., how does mortality change with income and education?).

vi.    Compare mortality rates across racial and ethnic groups. Compare the diedin5 rate for those who are white,  blacks and  Hispanic. Comment on the comparison  result. (one/two lines)

vii.   Use the “tab” command together with the “sum” command to get the means of “diedin5” jointly by educ and incomeg. Use these results to fill in Table 1.

viii.  Provide  an  example (one/two lines) of  what  the  numbers  in  Table  1  mean  (for

example, what the number in the column/row 1/1 means?).

Table 1. Means of diedin5 jointly by education and income

Income

Education

1

2

3

4

5

6

1

2

3

4

2) Regression I

Race,  income,  and  education  are  categorical  variables.  Construct  dummy  variables  for  race groups 2-4  (i.e.  race2,  race3  and  race4),  income  groups  2-6  (i.e.  income2,  income3,  income4, income5 and income6) and education groups 2-4 (i.e. education2, education3 and education4). Remember this can be done with a single command.

Next,  run a  regression of diedin5 on age,  male,  married,  plus  constructed  dummies for  race, income, and education.

i.      Interpret  the coefficient on age. As a person ages  10 years, what happens to five year mortality rates?

ii.     Generate the log of age. Run again the regression above but replacing age with the log of age. Interpret again the effect of age.

iii.    Interpret the coefficient on married. Does this coefficient make sense to you? Why?

iv.    Look at the coefficients on the income dummy variables.  Interpret the coefficient on the dummy for income group 6 (i.e., what the coefficient means? how large is the effect?)

v.     Why  is a good  idea to  replace categorical variables  like  race,  education,  income, with dummies?

vi.    At the  5%  significance  level,  can you  reject  or  not  reject  the  null  hypothesis  that  the coefficient on married is equal to zero? Why? How the t-statistics is obtained?

3) Regression II

Construct a set of dummy variables for the following four variables:

.      Underweight, if BMI<=19

.     Overweight, if BMI>25 and BMI<=30

.     Obese, if BMI>30 and BMI<=35

.     Severely obese, if BMI>35.

Add the four dummy variables above to the previous regression you used in Part 2 (ii).

i.      Why you have not included a dummy for 19

ii.     Interpret the coefficient on overweight. Does this coefficient make sense to you? Why?

iii.    Why is the coefficient on underweight such a large positive number?

iv.    Run a test of joint significance on dummies generated above using BMI to test whether they are jointly significant. What is the null? Do you reject the null? Why?

v.     Run a test of linear restrictions and test that the effect of married is equal to the effect of male. What is the null? Do you reject the null? Why?

4) Test for heteroscedasticity

i.      Use  the   model  estimated  in   Part  3  to  test  for   heteroscedasticity.  What   is  the   null hypothesis? Do you reject the null? What are issues related to heteroscedasticity? What

are the possible tests you can use?

ii.     Run  again  the  regression  above  controlling  for  heteroscedasticity.  Can  you  see  any difference? Why?

5) The Gauss-Markov Theorem

i.      Illustrate the Gauss-Markov Theorem. What are the assumptions  made? What desirable properties the OLS estimator will then have? What each property means for you?

ii.    Are the  assumptions  always  holding? If they are violated, what issues may occur? How you will handle these violations?