University of East Anglia: School of Economics

ECO-5006A: Introductory Econometrics

Autumn 2020

Stata Project


Please read the Project Assignment Brief and the instructions below very carefully be-fore attempting any of the questions. The Assignment Brief is available on the ‘Project’ section on the module’s Blackboard and provides some general information and further instructions.


• All the statistical analysis needs to be done using the Econometrics software package, Stata

• The data set PROJECT 2020.dta contains information on 28,424 university graduates in fulltime employment, based on DLHE Survey of 2016/17. Please refer to the Assignment Brief for more information about this data set.

• The questions of the Project are based on the following main topic:


“Does studying Economics pay off, relative to other subjects in Social Sciences?

Evidence using data of recent UK graduates”


• In particular, is there evidence that Economics graduates ‘do better/worse’ in the graduate market, relative to graduates who studied the other subjects available in the data? And how much better/worse do they do? By ‘doing better/worse’, we mean whether:

– they earn more/less based on self-reported salaries of graduates (before any deductions)

– they are more/less likely to secure a managerial / professional position after graduation

Thus, in your analysis, you need to use both outcome variables ‘salary’ and ‘professional’.

• In questions that require you to use Stata commands to get your answer, make sure you clearly show these Stata commands within your answers.

• Presentation of your answers matters. Thus, please, all graphs, equations, results and discussions need to be well-presented.

• For the main text, you need to use ‘calibri’ font of size 11, and allow 1.15 line spacing. For text within tables, you can use smaller font, up to size 9. You are also allowed to change the size of your graphs, as long as the graphs are still clear to read (i.e. clear legends, titles, etc.)

• Please make sure your answer to each part of the Project (i.e. part (a), part (b), etc.) starts on a new page.


QUESTIONS

(a) [12 Marks]

Investigate the main question of the Project by using descriptive statistics only; e.g. by appro-priate use of means, medians, variances, graphs, etc. Don’t forget that you need to investigate this in terms of both outcome variables ‘salary’ and ‘professional’. There is no word limit for this question, but your answer needs to be presented within two A4 sides (so, all tables, graphs and discussions need to be presented within two A4 sides, i.e. one full page).


(b) [50 Marks]

In this part you need to investigate the main question of the Project by using regression analysis (i.e. appropriate MLR models). Don’t forget that you need to investigate this in terms of both outcome variables, ‘salary’ and ‘professional’. That is, you will need two separate MLR models, one using ‘salary’ as the dependent variable, and one using ‘professional’ as the dependent variable.

So, in this part, you need to investigate whether Economics graduates are expected to ‘earn more/less’ relative to each of the other subjects, and whether Economics graduates are more/less likely ‘to get into professional roles’, holding other variables fixed (i.e. if Economics graduates had the same tariff scores, the same socio-economic background, etc., with the graduates of the other subjects).

Here are some important instructions/notes. Please read these very carefully:

(1) The dependent variable salary must be used in logarithmic form (i.e. the natural log of salary). Note that the professional dependent variable is binary (taking value 0 for ‘non-professional’ and value 1 for ‘professional’). An example for an MLR model with a binary dependent variable (that is, the Linear Probability Model) has been covered in the live lecture of Week 11.

(2) Your main explanatory variable (i.e. subject) is categorical, so it needs to be added in the MLR model in the form of dummy variables. We have seen numerous examples of how to include categorical variables in the MLR model using dummy variables from Week 9 onwards (both on the Asynchronous lecture notes/videos and on the Synchronous sessions).

(3) In your discussion of the results of your regression models, you need to provide an appro-priate interpretation of the coefficients of the subject-related dummy variables. You need to provide this interpretation:

(i) for models that don’t include any other explanatory variables. That is, in the mod-els where you regress log(salary) or professional just on the subject-related dummy variables. Let’s call this the ‘empty’ model.

(ii) for models that include the rest of the explanatory variables. See point (6) for instruc-tions related to what other explanatory variables to include in your models. Let’s call this the ‘full’ model.

(4) You also need to conduct hypothesis testing, to test:

(i) whether there is statistical evidence that the mean log of salary of Economics graduates differs from the mean log of salary of graduates of each of the other subjects (i.e. using the results of your ’salary’ model), and whether Economics graduates are more/less likely to get into professional jobs (i.e. using the results of the ‘professional model’), holding the other variables fixed. You can do this by commenting on the relevant p-values obtained in Stata, and you need to do this for both the empty and the full models.

(ii) whether there is evidence of joint significance of the subject dummy variables using F-tests (this was covered on Week 10), again for both the ‘salary’ and the ‘professional’ models, holding the other variables fixed. For the F-tests, please provide the obtained F-statistics as well as the p-values. Note also that F-testing needs to be done only on your full models.

(5) Following your interpretation of coefficients and hypothesis testing, also comment on how much your results have changed by including the additional explanatory variables (i.e. how much the effects of subject dummy variables on the outcome variables have changed, by ‘holding these characteristics/factors fixed’ across graduates), both in terms of magnitude and statistical significance.

(6) It is up to you to decide which other explanatory variables you add to your ‘salary’ and ‘pro-fessional’ models, and it is not necessary that both the ‘salary’ model and the ‘professional’ model include the same variables. Note that categorical variables (such as degree class or region, need to be added as dummy variables. For each explanatory variable that you add, you need to offer a short justification on why it is important for these variables to be included in the model (about 100-150 words for the justification of each variable). Note that you need to present a single justification for both ‘Salary’ and ‘Professional’ models, instead of a 100-150 words justification for ‘Salary’ and then another 100-150 words justi-fication for ‘Professional’. Also, for the variables that you decide not to add to your model, you also need to provide justification as to why these were not added (about 50-100 words for each variable not added). For an example, please see the uploaded ‘Example of vari-able justification for Part B - a hypothetical example’ document, available in the SAMPLE EXAMPLE folder.

(7) Note that variables tariff and age must be included in the full models.

– For tariff, you need to decide whether you use it in its linear form, or whether you include a quadratic term / replace it by the natural log of tariff. Your choice needs to be justified within your justifications above.

– For the variable representing the graduates’ age, it must be included in the model as a quadratic function (i.e. add both age and age2 ). You also need to provide two graphs, one for the predicted log of salary against age, and one for the predicted probability of getting into professional employment against age. Then, based on these graphs, you need to discuss the relationship between age and salary/professional (in about 200-250 words overall).

– Note that, for tariff and age, you don’t need to explain why these variables have been added to the model.

(8) Your ‘salary’ regression model needs to be tested for violation of MLR5 (i.e. whether there is a heteroskedasticity problem). Conduct this test only for your full model and present your test statistic as well as the p-value of this test. If there is statistical evidence of heteroskedasticity, then the standard errors presented in your regressions must be made ‘robust to heteroskedasticity’. Also, if your model is made robust to heteroskedasticity, then note that the hypothesis testing, under point (4), need to be done on the ‘robust’ model. Please note that testing for heteroskedasticity and correcting the standard errors is covered in the material of Week 12. Also note that in your ‘professional’ regression model, heteroskedasticity robust standard errors must be used (you don’t need to test for heteroskedasticity in the ‘Professional’ model).

(9) Note that all your regression results need to be presented in one or two tables. There are Stata commands that create such tables automatically, such as the ‘outreg2’ command. This was discussed in the Support Session of Week 10 (an extract of the video recording where I discuss this, can also be found in the ’Introduction to Stata Material’ section on the module’s BB). My suggestion is to have one table with 5 columns. A first column for the variables, two columns for the estimates of the ‘salary’ models (one for the empty and one for the full model), and two columns for the estimates of the ‘professional’ models (again, one for the empty and one for the full model).


(c) [9 Marks]

Provide theoretical justification of your main findings in this project. This discussion needs to focus on the main topic of the project (i.e. ‘does study economics pay off relative to the other subjects?’). You also need to provide up to three academic references as part of your justification. Note that these academic references can be either published papers in academic journals or other academic reports published by academic institutions (such as the Institute of Fiscal Studies). Newspaper articles are not valid academic references. The answer to this part must be contained within one side of a page.


(d) [9 Marks]

Identify the 3 most important problems/limitations in your models/results and explain:

(i) why these are important problems/limitations (in terms of affecting the reliability of your estimated coefficients)

(i) how each of these problems/limitations could be addressed.

The answer to this part must be contained within one side of a page.


(e) [10 Marks]

Present the main findings of your regression analysis within a single graph. Note that this can be a ‘combined graph’ and it can be produced either in Stata or Excel. A suggestion of what kind of graph to create is provided in the video recording of the Project Discussion SessionsSession C (available in the Project section on BB). If the graph has been done in Stata, you don’t need to provide your Stata command. Also provide a discussion/summary of the findings presented in this graph and try to avoid using technical language (i.e. econometric terminology that would not make sense to a non-specialist). The answer to this part must be contained within one side of a page.


(f) [10 Marks]

Only for the ‘salary’ regression model, in your full model, add interaction terms between the subject dummy variables and one of the explanatory variables. Estimate this model, present your results and provide a discussion of the additional insights obtained following the model with the interaction terms. Note that for this part, you can just copy/paste the Stata output instead of creating your own table. Within your answer you also need to justify why you have picked this explanatory variable for the interaction terms. The answer to this part must be contained within two sides of a page, i.e. one full page.


- END OF QUESTIONS -