Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Econ 4305: Assignment 1

Due on 1 September at 11.59pm

August 13, 2023

Collaboration is allowed and encouraged, but each student should independently write their own an- swers and code. Copy-pasting from someone else is NOT allowed. Same goes for AI tools / ChatGPT, you can consult it and get it to help you, but you cannot copy paste.  The answers to the questions as well as the code used to generate any numbers, tables, and figures in the answers should be submitted via Canvas by 11.59pm on the due date.  Submit the answers as a PDF together with the do-file used to generate any numbers, tables, or figures in the assignment submission folder. Please name your PDF and do-file with you student number only (e.g. A0318795M.pdf and A0318795M.do). Do not zip the files into one zip file.

In Sections 1 and 2 we will be using a few variables from the 1979 Cohort of the National Longitudinal Survey of Youth (NLSY). The NLSY79 Cohort is a longitudinal project that follows the lives of a sample of American youth born between 1957-64. The cohort originally included 12,686 respondents ages 14-22 when first interviewed in 1979; after two sub-samples were dropped, 9,964 respondents remain in the eligible samples.

The data that you can find in the Canvas folder has been cleaned and is ready for use. If you are inter- ested you can access the raw data, containing many more variables by going to:

https://www.nlsinfo.org/content/cohorts/nlsy79

In Section 3 we will be using survey data on hair length, height, and gender.

Style guide:

Organize your answers in the same order as they appear in the assignment. Label the answers "Answer 1", "Answer 2", ... etc. Label the Figures and Tables as well.

Justify your answers but keep them short and concise.  No questions needs an answer longer than 3-4 sentences. Most questions just need one sentence.

For any answer containing a value above 10 you don’t need to include any decimal points.  For values between 1 and 10 use one decimal point. For values below 1 use two decimal points (e.g. if the value is 0.001 you can write 0.00).

1 Explore the data

Start by creating a do-file.  Load the data file "nlsy_5vars" using the "use" command.  Use the describe command and the browse window to read the variable labels and familiarize yourself with the 5 vari-ables.

Create a histogram of the variable for income in 2018 (Q13 5 TRUNC_2018) (you can use command "hist Q13 5 TRUNC_2018").

Question 1: What stands out in this histogram?  Are there some values that are more common than others? What do you think is the reason for these values being common?

Create a histogram of the variable for education (HGC_EVER_XRND).

Question 2: What stands out in this histogram?  Are there some values that are more common than others? What do you think is the reason for these values being common?

(No need to paste the histograms into the answers, I’ll see how you created the histograms in your code.)

2    OLS Regressions: Income and Education

For each regression you run in Section 2save the results using the eststo command shown in the lecture.

2.1 Univariate regression

Run the regression:

Income i = α + β × Education i + ε i

(hint: eststo reg1: reg Q13 5 TRUNC_2018 HGC_EVER_XRND, robust )

2.1.1 Interpret the coefficient

Question 3: Interpret the coefficient beta in one sentence.

2.1.2 Extensive margin effect

Create a variable for having any income (=1 if income>0 and =0 if income==0). Run the regression:

Any_income i = α + β × Education i + ε i

Question 4: Interpret the coefficient beta in one sentence.

2.1.3 Non-linear relationship

Create a variable for the natural logarithm of income (ln_income). Create a histogram for ln_income. Question 5: What kind distribution does the distribution in the histogram remind you of?

Run the regression:

ln_income i = α + β × Education i + ε i

Question 6: Interpret the beta coefficient in one sentence.

2.1.4 Causal interpretation

Question 7: Explain two potential reasons for why β may not be the causal effect of education on income? (2-3 sentences, make sure to name the type of bias in your explanation)

Question 8: Would the biases you described be positive or negative (i.e. you it make β(ˆ) larger or smaller

than the causal effect). Justify your answer in two sentences.

2.2 Multi-variate regression

Run the regression:

ln_income i = R + β1 × Education i + β2 × Fathers_Education i + β3 × Mothers_Education i + ε i

Question 9: What kind of bias could adding these controls potentially solve? (explain in 1-2 sentences, feel free to refer to your answer to question 7)

2.3    Create table

Table 1: Create a table with all of the regressions above, using the esttab command used in class. Make it look nice by giving the variables labels that are easy to understand. Suppress output that is not interest- ing for the question about education’s effect on income and add notes saying "Yes" or "No" indicating if those controls are used or not. Paste / input the table in your answers.

2.4    Good or bad control?

Now imagine that we are interested in the causal effect of a person’s father’s education on education on income. We start by running the following regression (no need to store results or make table):

ln_income i = R + β1 × Fathers_Education i + ε i

Then run the regression:

ln_income i = R + β1 × Fathers_Education i + β2 × Mothers_Education i + ε i

Question 10: How does including the mother’s education into the regression change the coefficient on Fathers_Education i? Do you think it gets us closer or further away from the total causal effect of father’s education on income?  If you think it is getting us closer to the causal effect, what type of bias is it reducing? If it gets us further away form the causal effect, why is it a "bad control"? (2-3 sentances)

Question 11: Now look back at your results from the regression in part2.2 (where we control for edu- cation as well as mother’s education). Is the coefficient on Fathers_Education i  smaller or larger than in the previous regression? Do you think controlling for Education i gets us closer or further away from the total causal effect of father’s education on income? If you think it is getting us closer to the causal effect, what type of bias is it reducing? If it gets us further away form the causal effect, why is it a "bad control"?

3    Scatter plots: Length of hair and height

Load the data file "height_hair_gender.csv" data using the "import" command. Use the describe command to read the variable labels of the 4 variables.

3.1    Clean the data

Before we can use the data we have to clean it.  Begin by turning the howlongisyourhairincm variable into a numerical variable using the destring command.

Pro tip: before using destring you can remove the letters "cm" from all observations using a loop like the one below.

forval  i=1/100  {

replace  howlongisyourhairincm="`i'"  if  howlongisyourhairincm=="`i'cm"

}

3.2 Find the association between height and hair length

Run the following regression:

height i = α + β1 × hair length i + ε i

(hint: is there a risk for hetroscedasticity? If yes, then use robust standard errors)

Question 12: Is the value of the β1 coefficient surprising to you? Explain your answer in one sentance.

3.3 Use a scatter plot to find an issue with the data

Figure 1: Create a scatter plot with a linear prediction line (similar to that in the lecture) with height on the y-axis and hair length on the x-axis. Include the plot in the answers. Fix any problem you observed in the data from the scatter plot.

Figure 2: Rerun the regression and create a new scatter plot.

Question 13: Is the β1 statistically significantly different from zero (at the 5% level)? How likely is it that we would find a coefficient with the absolute value of β1 if there was no association between height and hair length?

Question 14: Is the β1  (slope of the predicted line) surprising to you now? Answer with one sentence.

3.4 Control for gender

Create an indicator variable called "female" that takes the value of 1 if the respondent is female and 0 otherwise.

(hint: gen female=(condition for indicator variable to be 1)

Run the following regression:

height i = α + β1 × hair length i + β2 × female i + ε i

Question 15: Do you think β1 is a plausible estimate of the causal effect of hair length on height? Why / why not? (1-2 sentences)

3.5 Scatter plot with residualized outcome variable

Create a scatter plot with a linear prediction of the residual from the regression below on the y-axis and hair length on the x-axis (similar to what I showed in the lecture):

height i = α + β × female i + ε i

Figure 3: Add the graph to the answers.

3.6 Interpretation

Question 16: If you have no other data then hair length, is that a a useful predictor of height?  Justify your answer with one sentence.

Question 17: Do you think there is a causal relationship between hair length and height? Justify your answer with one sentence.