ECO3015 Assignment 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ASSIGNMENT 1
(Each part carries equal marks. Sub-question carries equal marks.)
The National Health Interview Survey (NHIS) is an annual survey conducted by the Department of Health and Human Services to measure the state of health in the country. The National Death Index is a master database of everyone that has died in the US including their Social Security number. Using detailed identifiable information about respondents to the NHIS, starting in the 1980s, respondents to the NHIS were matched to the National Death Index and the file then included a variable that identified when and if a person died within a follow-up period. The data also includes the cause of death. This merged data is called the NHIS Multiple Cause of Death data (MCOD). I have constructed a data set that has respondents aged 25-64 from the NHIS/MCOD data. Below is a table that describes the variables in the data set.
Variables |
Description |
diedin5 |
Dummy variable, =1 if the respondent died within 5 years of the survey, =0 otherwise |
Male |
Dummy variable, =1 if the respondent is male, =0 otherwise |
Age |
Age in years |
Married |
Dummy variable, =1 if the respondent is married, =0 otherwise |
Race |
Categorical variable for race and ethnicity. =1 if respondent is white, non-Hispanic, =2 if black, non-Hispanic, =3 if other race, non-Hispanic, =4 if Hispanic |
Educ |
Categorical variable for educational level. =1 if respondent has less than a high school degree, =2 if a high school degree, =3 if some college, =4 with a bachelor’s degree or more |
Incomeg |
Categorical variable for family income. =1 if family income is ≤$10K, =2 if >$10K and ≤$20K, =3 if >$20K and ≤$30K, =4 if >$30K and ≤$40K, =5 if >$40K and ≤$50K, =6 if >$50K. |
Bmi |
Body mass index, weight in kg/ height in cm squared |
Srhealth |
Categorical variable for self-reported health status. =1 if excellent health, =2 if very good, =3 if good, =4 if fair =5 if poor. |
1) Start with some data description
Create a unique dataset. Drop parts of data according to the last four digits of your student number (i.e., your student ID is 40271234, please drop the first 1234 observations; if your student ID of 40003999, please drop the first 3999 observations, etc.).
i. Browse the dataset. What is the data type?
ii. “Summarize”the data and provide comments about the descriptive statistics.
iii. Check the distributions of variables. Are they normal distributions? Why? Are there signs of possible outliers?
iv. Some of the variables are categorical variables. Use the “tab” command to have an idea of what fraction of the sample died in the 5 years after the survey? Comment on the distribution of diedin5 (one/two lines).
v. Use again the “tab” command to calculate the fraction of “diedin5” by education (”educ”) first and then by income (incomeg) groups (this is a cross-tabulation). Provide some intuition (one/two lines) on the distribution of mortality rate by
education and income (i.e., how does mortality change with income and education?).
vi. Compare mortality rates across racial and ethnic groups. Compare the diedin5 rate for those who are white, blacks and Hispanic. Comment on the comparison result. (one/two lines)
vii. Use the “tab” command together with the “sum” command to get the means of “diedin5” jointly by educ and incomeg. Use these results to fill in Table 1.
viii. Provide an example (one/two lines) of what the numbers in Table 1 mean (for
example, what the number in the column/row 1/1 means?).
Table 1. Means of diedin5 jointly by education and income
|
Income |
||||||
Education |
|
1 |
2 |
3 |
4 |
5 |
6 |
1 |
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
3 |
|
|
|
|
|
|
|
4 |
|
|
|
|
|
|
2) Regression I
Race, income, and education are categorical variables. Construct dummy variables for race groups 2-4 (i.e. race2, race3 and race4), income groups 2-6 (i.e. income2, income3, income4, income5 and income6) and education groups 2-4 (i.e. education2, education3 and education4). Remember this can be done with a single command.
Next, run a regression of diedin5 on age, male, married, plus constructed dummies for race, income, and education.
i. Interpret the coefficient on age. As a person ages 10 years, what happens to five year mortality rates?
ii. Generate the log of age. Run again the regression above but replacing age with the log of age. Interpret again the effect of age.
iii. Interpret the coefficient on married. Does this coefficient make sense to you? Why?
iv. Look at the coefficients on the income dummy variables. Interpret the coefficient on the dummy for income group 6 (i.e., what the coefficient means? how large is the effect?)
v. Why is a good idea to replace categorical variables like race, education, income, with dummies?
vi. At the 5% significance level, can you reject or not reject the null hypothesis that the coefficient on married is equal to zero? Why? How the t-statistics is obtained?
3) Regression II
Construct a set of dummy variables for the following four variables:
. Underweight, if BMI<=19
. Overweight, if BMI>25 and BMI<=30
. Obese, if BMI>30 and BMI<=35
. Severely obese, if BMI>35.
Add the four dummy variables above to the previous regression you used in Part 2 (ii).
i. Why you have not included a dummy for 19
ii. Interpret the coefficient on overweight. Does this coefficient make sense to you? Why?
iii. Why is the coefficient on underweight such a large positive number?
iv. Run a test of joint significance on dummies generated above using BMI to test whether they are jointly significant. What is the null? Do you reject the null? Why?
v. Run a test of linear restrictions and test that the effect of married is equal to the effect of male. What is the null? Do you reject the null? Why?
4) Test for heteroscedasticity
i. Use the model estimated in Part 3 to test for heteroscedasticity. What is the null hypothesis? Do you reject the null? What are issues related to heteroscedasticity? What
are the possible tests you can use?
ii. Run again the regression above controlling for heteroscedasticity. Can you see any difference? Why?
5) The Gauss-Markov Theorem
i. Illustrate the Gauss-Markov Theorem. What are the assumptions made? What desirable properties the OLS estimator will then have? What each property means for you?
ii. Are the assumptions always holding? If they are violated, what issues may occur? How you will handle these violations?
2023-11-23