Applications of Econometrics Spring 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Applications of Econometrics
Assessed Group Project
Spring 2022
The role of unions in labour markets is a long-standing research topic in labour economics. It is often found, for example, that unionised workers get paid higher wages, and some authors argue that
unionisation could lead to lower inequality. Unions have declined in importance somewhat in the second
half of the 20th century, but there is currently a resurgence in unionisation in some countries, arguably as a response to increasing inequality.
In this project we try to estimate several effects of unions for the U.S. using the Survey of Income and Program Participation (SIPP). This is a household panel dataset with detailed information for a sample of U.S. households. It is representative for the U.S. population and has been used in many applied research projects. See the section ’Getting Started’ below on how you can obtain the data and prepare it for analysis. We think it makes sense to limit your sample to prime-age workers (age 25-50) for the entirety of the project.
All parts of the project carry equal weight. Groups have to submit a word/pdf file that has answers to the questions below along with a dofile that has all the commands in it that the group used.
● Both the word/pdf document and the dofile have to be submitted before the deadline. Projects submitted without a dofile will incur the default penalty of a late submission (5 marks).
● Answers to questions should be limited to 3 pages per question (1-2 pages is probably enough).
● The dofile should be written in such a way that anyone with access to the raw data files can replicate the analysis.
● Stata outputs (tables/figures) have to be included in the document. It is not enough to refer to outputs in the Stata log/dofile.
● Before submission groups have to declare that the project is their own work. There is no separate form to complete, it can be done directly on Learn.
● Make sure that you are aware of the requirements for appropriate citation of references and data sources. Read the guidance on plagiarism in Section 4.4.1 of the Economics Honours Handbook and/or the general University guidance. If you include anything from another source it must be properly acknowledged, whether it’s a figure/table or a text passage or anything else.
● You are welcome to ask questions on piazza or come to helpdesks. We will try to help as much as possible with data preparation and are of course happy to clarify where things are unclear. Topical questions we will typically not answer to be fair to all students.
Time Series Questions
For this part you have to aggregate the SIPP data to a monthly time series (see ’Getting Started’ below).
1. Calculate averages by month and year of unionisation, log real wages and unemployment. Plot the time series for these over time. Make sure you label the axes correctly. Then run a simple regression of log real wages on unionisation rates, and unemployment on unionisation rates. The former could capture whether unionisation rates positively or negatively affect average wages, the latter could capture whether unionisation rates increase/decrease unemployment. Interpret your findings.
2. Is there evidence for seasonality and trends in unionisation rates, log real wages, and unemployment rates? What about serial correlation or heteroskedasticity? Investigate these issues and try to make your results in (1.) robust to those concerns.
3. The regressions we ran in (1.) and (2.) might suffer from omitted variable bias. This could be due to dynamics (e.g. lags) that you haven’t already dealt with in (2.) or important other factors not captured by our variables. Explain why this might be the case and propose solutions. Run regressions including controls you think are sensible and interpret your results. The variables you include don’t have to be from the SIPP, e.g. you could try to include quarterly log GDP. l Hint: Don’t include too many, we have a limited number of observations here. I’d say 10 is the absolute max. We’re not looking for a perfect specification here, that’s almost impossible. We’re looking for two or three specific concerns and how you could deal with those. You don’t have to try to solve all the potential problems.
Panel Questions
For this part you have to use the panel dimension of SIPP. A panel entity in this dataset is a person and time is measured in months.
4. Provide some descriptive statistics for your variables, such as the mean, minimum, and maximum of key variables (unionisation, wages, age, education etc). Make sure you provide clear indications of what you are reporting. This means do not include the raw variable names in the table. Instead, use a descriptive label like ’hourly wage in $’. Hints: Report shares instead of means for categorical variables (e.g. education). It probably makes sense to report descriptive statistics for the sample you are using in the following questions (Q5-Q7), e.g. workers currently earning a positive wage. It might also make sense to check how representative your sample is for the population. In this question formatting is especially important so make sure your tables/figures are clearly labelled and self explanatory.
5. Estimate the union wage premium by pooled OLS. We usually do this by regressing log real wages on an indicator for whether the worker belongs to a union. Include your own choice of control variables. Some suggestions: education, time trends, whether there are children in the household, marital status, age, and race.
6. Estimate the union wage premium using random effects and fixed effects and compare your estimates to the POLS results (include the same controls as far as possible). Make sure your results are robust to heteroskedasticity and serial correlation. Explain why the numbers are different, which estimates we trust most, and discuss your findings.2
7. Now we look at some heterogeneity. Using the fixed effects estimator (and again your set of controls), estimate whether the union wage premium is different for e.g. women and men. You can be creative here. It would be interesting to compare the premium in different industries, for example. Discuss your findings.
l E.g. available at https://fred.stlouisfed.org/.
2 When discussing your findings a flavor of theoretical reasoning is a plus here. I.e. why do economists think there’s a
link between wages and being in a union? See e.g. ’Labor Economics’ by George Borjas, McGraw Hill 2020, or many online
resources e.g. https://economics.mit.edu/files/4689 is advanced but very good.
Getting Started
In this section we provide basic instructions how to download the dataset and make it ready for analysis. The extracts below are from data-prep.do, which is available on Learn. We will update this section if
many students struggle with something (we also might have overlooked something of course). You are very welcome to come to the helpdesks or ask on piazza if you have problems.
You can find all the raw datasets at https://www.census.gov/programs-surveys/sipp/data/ datasets.html. Since these datasets can be very big we uploaded files on Learn that exclude some probably unnecessary variables and only include prime-age workers (age 25-50). You can use those and/or download additional files from the U.S. Census Bureau directly. Each file (wave) contains 12 months of a year, so we have the same person roughly 12 times per wave. Here we show you how to merge the datasets we put on Learn together. That gives you 6 years (72 months) of data. You are free to extend the data further back but be warned that this is not easy because the structure of the survey changed.
/*==============================================================================
PREPARATION OF SIPP DATA, by AofE Teaching Crew
Description: Append waves of SIPP data and select variables.
Download data from Learn or https://www.census.gov/programs-surveys/sipp.html
==============================================================================*/
* Type in path/folder where the dataset is located global datapath "."
* Open the file containing 2020 wave 1 data (covering January-December 2019)
* We already pre-selected prime-age workers (keep if tage >= 25 & tage <= 50)
* and dropped unnecessary variables
use $datapath/pu2020_prime, clear
* Keep variables (you won’t need all of them, this is just a selection of potentially useful ones)
* NOTE: If you want to add variables from the full dataset just enter them here
keep eplaydif eddelay tjb1_mwkhrs tjb1_msum esex ems erp spanel ssuid erace tage eeduc rmesr /// edisabl efree_lunch edaycare tutils tosavval pnum tjb1_occ tjb1_ind ejb1_scrnr eafnow /// monthcode tpearn tmwkhrs rwksperm tage rmwkwjb twkhrs1-twkhrs5 *_union *_cntrc
gen refyear = 2019
lab var refyear "Calendar year"
lab var monthcode "Calendar month"
* Note that ’refyear’ and ’monthcode’ are crucial variables for the analysis as
* they capture the time period all of the other variables refer to
*=======================================
* APPEND ADDITIONAL YEARS *
*=======================================
* Append data from 2019 wave 1 (covering January-December 2018)
append using "$datapath/pu2019_prime", keep(eplaydif eddelay tjb1_mwkhrs tjb1_msum esex ems /// erp spanel ssuid erace tage eeduc rmesr edisabl efree_lunch edaycare tutils /// tosavval pnum tjb1_occ tjb1_ind ejb1_scrnr eafnow monthcode tpearn tmwkhrs rwksperm ///
tage rmwkwjb twkhrs1-twkhrs5 *_union *_cntrc)
replace refyear = 2018 if missing(refyear)
* Append data from 2018 wave 1 (covering January-December 2017)
append using "$datapath/pu2018_prime", keep(eplaydif eddelay tjb1_mwkhrs tjb1_msum esex ems /// erp spanel ssuid erace tage eeduc rmesr edisabl efree_lunch edaycare tutils /// tosavval pnum tjb1_occ tjb1_ind ejb1_scrnr eafnow monthcode tpearn tmwkhrs rwksperm ///
tage rmwkwjb twkhrs1-twkhrs5 *_union *_cntrc)
replace refyear = 2017 if missing(refyear)
* Append data from 2014 wave 4
append using "$datapath/pu2014w4_v13_prime", keep(eplaydif eddelay tjb1_mwkhrs tjb1_msum esex ems ///
erp spanel ssuid erace tage eeduc rmesr edisabl efree_lunch edaycare tutils /// tosavval pnum tjb1_occ tjb1_ind ejb1_scrnr eafnow monthcode tpearn tmwkhrs rwksperm /// tage rmwkwjb twkhrs1-twkhrs5 *_union *_cntrc)
replace refyear = 2016 if missing(refyear)
* Append data from 2014 wave 3
append using "$datapath/pu2014w3_v13_prime", keep(eplaydif eddelay tjb1_mwkhrs tjb1_msum esex ems ///
erp spanel ssuid erace tage eeduc rmesr edisabl efree_lunch edaycare tutils /// tosavval pnum tjb1_occ tjb1_ind ejb1_scrnr eafnow monthcode tpearn tmwkhrs rwksperm /// tage rmwkwjb twkhrs1-twkhrs5 *_union *_cntrc)
replace refyear = 2015 if missing(refyear)
* Append data from 2014 wave 2
append using "$datapath/pu2014w2_v13_prime", keep(eplaydif eddelay tjb1_mwkhrs tjb1_msum esex ems ///
erp spanel ssuid erace tage eeduc rmesr edisabl efree_lunch edaycare tutils /// tosavval pnum tjb1_occ tjb1_ind ejb1_scrnr eafnow monthcode tpearn tmwkhrs rwksperm /// tage rmwkwjb twkhrs1-twkhrs5 *_union *_cntrc)
replace refyear = 2014 if missing(refyear)
tab monthcode refyear // Tabulates number of observations per reference year and month
* Save data
compress // often saves memory by making the dataset smaller (see help compress)
save $datapath/SIPPdata, replace
If you want to add additional variables a helpful command is lookfor. This searches through the
labels to find a search term. For example, you could find all variables that have ’children’ in the label by
using lookfor children.
Once you have something like our “SIPPdata.dta” that contains monthly data for respondents you can start preparing variables. Here we show you how we might get hourly wages, unionisation status, and unemployment status. We also convert nominal to real wages using the CPI (also available on Learn).
* Hourly earnings (wage)
g wage = tpearn / (tmwkhrs*4*rmwkwjb/rwksperm)
* It’s a survey so we get some weird values, e.g. negative wages
* We’ll just ’fix’ those by setting them to zero replace wage = 0 if wage < 0
* Similarly, some wages will be unrealistically high. Topcode those at 95% su wage, d
replace wage = r(p95) if wage > r(p95) & !missing(wage)
* Merge in the CPI to get real wages
merge m:1 refyear monthcode using $datapath/cpi, keep(match) nogen
* Get the real wage (in 2018 prices); you can use ’rwage’ as a measure for real
* wages for all questions in the project g rwage = wage / cpi
* Create monthly indicator for union status egen temp = rowmin(ejb*_union)
g union = temp == 1
* Create monthly indicator for unemployment
g unemployed = rmesr == 5 | rmesr == 6 | rmesr == 7
label var wage "Nominal wage in $"
label var rwage "Real wage in 2018$"
label var union "Member of a union this month"
label var unemployed "Unemployed this month"
To answer the time series questions you will need to aggregate the individual-level survey data and calculate monthly averages. We show you one way to do this here.
* collapse the dataset to averages by month and year
collapse (mean) union rwage unemployed, by(monthcode refyear)
g monthly_date = ym(refyear,monthcode)
tsset monthly_date
format %tm monthly_date
To work with the panel data we need to create a unique person id that lets Stata know what the panel unit is. You could do this as follows.
egen id = group(ssuid pnum)
g monthly_date = ym(refyear,monthcode)
xtset id monthly_date
Finally, adding additional variables or determining what the codes correspond to can be a bit tricky.
We show you an example for how to generate a dummy for ’married’ here. First we need to find any variable that has ’married’ in the label:
lookfor married
> storage display value
>variable name type format label variable label >---------------------------------------------------------------------------
byte %12.0g Is ... currently married, ...
tab ems
> Is ... |
> currently |
> married, |
> widowed, |
> divorced, |
> separated, |
> or never |
> married? |
>------------+-----------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
>------------+-----------------------------------
> Total | 928,704 100.00
Then we need to find out what ’1’, ’2’ etc correpond to. You can find this in the ’Metadata’ pdf file that is available on the Census Bureau SIPP homepage. Here’s the entry for ’ems’:
Now we are ready to label the ems values and create a dummy for ’married’.
label define ems 1 "1. Married spouse present" 2 "2. Married spouse absent" 3 "3. Widowed" /// 4 "4. Divorced" 5 "5. Separated" 6 "6. Never married"
label values ems ems
tab ems
>Is ... currently married, |
widowed, divorced, |
separated, or never |
married? |
>--------------------------+-----------------------------------
>1. Married spouse present |
> 2. Married spouse absent |
3. Widowed |
4. Divorced |
5. Separated |
6. Never married |
>--------------------------+-----------------------------------
> Total | 928,704 100.00
g married = ems == 1 | ems == 2
2022-03-16