Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECONOMICS 705

Econometrics II (First Half)

Department of Economics

Fall 2023

PROBLEM SET 2

Version of October 15, 2023

Due at the start of class on October 30, 2023

Writing up your answers to the empirical problems

Your write-up should consist of two portions. The first portion is just the answers to the questions, with whatever text is required to explain them. This portion must be typed!

The second portion, on separate pages, consists of a Stata log file that shows how you got the answers to the empirical questions. The log file must be clear and must include comments that will allow the reader to quickly see the command or commands leading to each answer. It should not include everything you tried – just the final set of commands employed to get the answers.

Submit completed problem sets on canvas as a single PDF file with a filename of the form “LastnameFirstnamePS#.pdf” where “#” is 1, 2, or 3, depending on the problem set.

Problem sets not turned in on Canvas using the format just described will receive no credit.

Data for empirical problems

This problem set uses data from the National Supported Work Demonstration (NSW), one of the first major social experiments in the world.

A sequence of papers uses data on the male participants in the NSW to study the performance of alternative non-experimental identification strategies and estimators using the experimental impact estimates as a benchmark. These papers include LaLonde (1986), Heckman and Hotz (1989), Dehejia and Wahba (1999, 2002) and Smith and Todd (2005a,b).

LaLonde (1986) also studies the female participations in the NSW who were in the Aid to Families with Dependent Children (AFDC = “welfare”) target group, but his data on the AFDC women were lost.

Calónico and Smith (2017) recreate his analysis file for the AFDC women. It is their recreation that we use for this problem set. The Calónico and Smith (2017) paper is available on Canvas.

File name on Canvas: “Economics 705 Fall 2023 NSW Women Data.dta

The dataset contains the following variables:

treated: 1 for the experimental treatment group and 0 for the experimental control group

service: 1/0 received services

age: age in years

educ: years of schooling

black: 1/0 black

hisp: 1/0 Hispanic

married: 1/0 married

re74: real earnings in “1974”

re75: real earnings in 1975

re78: real earnings in 1978

Random assignment took place in 1976 and 1977, so real earnings in “1974” and real earnings in 1975 are conditioning variables while real earnings in 1978 is the outcome variable. Smith and Todd (2005) explain the shock quotes on “1974” .

Note that I made up the service variable for this problem set. In fact, the NSW experiment had very few no-shows and essentially no control group substitution.

Problems

1. (0 points) Drop observations with missing values of real earnings in 1978 (i.e. re78). In real life, one would worry more about this, and so do more about this, than we are doing here. There   are missing values of re78 because many individuals did not respond to the follow-up survey that provides the information used to construct it (“unit non-response”), or they did respond to the survey but did not respond to the specific question about earnings (“item non-response”).

2. (5 points) Divide re74, re75 and re78 by 1000.0 to make writing up the answers easier.

3. (5 points) Estimate the conditional mean of earnings in 1978 (re78) given earnings in 1975 (re75) in three ways:

a) A linear function of re75

b) A quadratic function (i.e. a linear term and a squared term) in re75

c) A cubic function (i.e. a linear term, a squared term, and a cubed term) in re75.

Describe and remark on the resulting estimates; be sure in particular to give a substantive interpretation to the estimates from the linear specification.

4. (5 points) Interpret the positive coefficient on the square of re75 in the quadratic model in the preceding question.

5. (5 points) The r2  values for the models in Problem 3 are quite low given that the right-hand side variable is a lagged version of the left-hand side variable. Is there a feature of the data that might account for this?

6. (5 points) Repeat the exercise in Problem 3 but including both re74 and re75 on the right-hand side (but without any interaction terms). How much do the r2  values increase relative to those obtained in Problem 3? Does the extent of the increase seem large or small? Explain.

7. (5 points) Estimate the conditional mean of real earnings in 1978 using indicators for the following five categories of real earnings in 1975:

[0.0], (0.0, 1.0], (1.0, 3.0], (3.0, 5.0], (5.0, ∞).

Interpret the resulting estimates and compare them to the estimates from the preceding problem both in terms of what they reveal about the conditional mean function and in terms offit.

8. (5 points) Is the estimator in the preceding problem non-parametric? Is it unbiased in finite samples?

9. (5 points) Create a graph in Stata with real earnings in 1975 on the horizontal axis and real earnings in 1978 on the vertical axis. Include the estimated regression line from the model with categories in Problem 7. The twoway command will be useful, specifically the scatter and line options.

10. (5 points) What happens to fit as measured by R2  if you reduce the number of categories in    Problem 7 by combining two of the existing categories? Does your answer depend on which two categories you combine? Does your answer change if the measure offit is R2  rather than R2 ?

11. (5 points) Among the three specifications in Problem 3 (linear in re75, quadratic in re75,  cubic in re75), the re75 category indicators in Problem 7, and the specification with cubics in both re74 and re75 in Problem 6, which do you prefer? Make an explicit case for your choice that includes at least two reasons to prefer your chosen estimator.

12. (5 points) Suppose that in your application what you really care about is predicting real earnings in 1978 for individuals with zero earnings in 1975. Based on the mean squared prediction error for these individuals, which model do you prefer? Explain.

13. (5 points) How does your answer to the preceding question change, if at all, if what you really care about is the mean squared prediction error for individuals with real earnings in 1975 greater than or equal to $10,000?

Estimate a conditional mean function (using npregress in Stata)

14. (5 points) Use npregress with the kernel option to estimate a nonparametric regression of real earnings in 1978 on real earnings in 1975. Use an Epanechnikov kernel, 50 bootstrap replications and set the random number seed (again) to “54321” .  Be sure to use the “estimator(constant)” and “noderivative” options so that you get the “local constant” estimator we discussed in class. We will consider “local linear” kernel regression in the lecture on matching and weighting estimators.

[Hint: Keep it to five bootstrap replications until you are sure you have the Stata code that you want, then do a final run with 50 bootstrap replications. This will save you a lot of waiting time,  as one feature of non-parametric estimation on datasets of reasonable size is that it is much more computationally intensive, and thus slower, than parametric estimation.]

15. (5 points) In the output from the npregress command in the preceding problem, explain the substantive meaning of the estimate in the Mean” row.

16. (5 points) Use the margins command to obtain predicted values of the conditional mean of re78 for re75 values in {0.0, 2.0, 4.0, 6.0, 8.0, 10.0}. Describe the predicted values and any interesting patterns they embody.

17. (5 points) Use a command like

marginsplot, legend(off) scale(1.1) addplot(scatter re78 re75 if re75 <= 10, msize(tiny))

to produce a scatterplot that includes a line formed by linking the predicted values obtained in Problem 16.

18. (5 points) Use the margins command to produce contrasts (i.e. estimated changes) between the predicted values of the conditional mean function obtained in Problem 16. An option like contrast(atcontrast(ar)) may prove useful. Describe and discuss the resulting contrasts.

19. (5 points) Use a command like

marginsplot, legend(off)

to plot the contrasts obtained in Problem 18.

20. (5 points) Repeat the exercises in Problems 14, 16 and 18, but fixing the bandwidth at 40 rather than having Stata choose the bandwidth via cross-validation. Use the option meanbwidth(40, copy) in the npregress command to accomplish this. Compare the results obtained using the two different bandwidths and account for any differences you find.

21. (5 points) Compare predicted values of the mean of re78 when re75 = 15 obtained using the linear, quadratic and cubic parametric models in Problem 3, and using the non-parametric regressions with bandwidths chosen by cross-validation and fixed at 40 estimated in Problems 14 and 20. Discuss any differences you find.