ECON6300/7320 Advanced Microeconometrics Tutorial 1: Introduction to R
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ECON6300/7320 Advanced Microeconometrics
Tutorial 1: Introduction to R
Learning Objectives
At the end of this tutorial you should be able to
● use R to read, manipulate, and save data,
● use R to compute summary statistics,
● use R to run linear regression,
● use R to conduct basic matrix operations,
● define your own R functions and implement optimization algorithms.
References
● Peng, R.D., 2022. R Programming for Data Science. [P]
● Hansen, E.B. 2022. Econometrics. Princeton University Press. [H]
● Cameron, A.C., and P.K. Trivedi, 2005. Microeconometrics: Methods and Applications. Cambridge University Press. [CT]
Current Population Survey Data
The Current Population Survey (CPS) is a monthly survey of about 57,000 U.S. households conducted by the Bureau of the Census for the Bureau of Labor Statistics. The CPS is the primary source of information on the labor force characteristics of the U.S. population. The survey covers employment, earnings, educational attainment, income, poverty, health insurance coverage, job experience, voting and registration, computer
usage, veteran status, and other variables. Details can be found at https://www.census.gov/program- surveys/cps.html.
Here we use a subset of March 2009 CPS survey data in which all individuals were full time employed. This
data has 50,742 observations and 12 variables (see the attached data description for more detailed information, e.g., variable definitions). We first load the CSV data set to R using read_csv(), and then
1. keep only female observations;
2. rename two variables (earnings, education) as (inc, educ);
3. generate two new variables, log(inc) and age2;
4. drop two variables, female and inc;
5. keep (and order) six variables for regression analysis, (loginc, age, age2, educ, hisp, union); and
6. drop all observations with any missing values in these variables.
Note that %>% is the pipeline operator that can conveniently string together multiple dplyr (part of the tidyverse package) functions in a sequence of operations. Basically, it can directly pass the output of the previous operation to the next operation as the input. You no longer need to define a temporary variable for each step to store these results. This is very useful when dealing with complex data as it saves you from
Table 1: Summary Statistics
Variable |
N |
Mean |
Std. Dev. |
Min |
Pctl. 25 |
Pctl. 75 |
Max |
loginc |
21602 |
10.498 |
0.626 |
1.386 |
10.127 |
10.859 |
13.054 |
age |
21602 |
42.244 |
11.453 |
15 |
33 |
51 |
80 |
educ |
21602 |
14.066 |
2.604 |
0 |
12 |
16 |
20 |
hisp |
21602 |
0.139 |
0.346 |
0 |
0 |
0 |
1 |
union |
21602 |
0.02 |
0.14 |
0 |
0 |
0 |
1 |
defining a large number of “single-use” variables, and at the same time makes your codes more readable. See [P] Section 13.11 for more illustrating examples.
Finally, we use the attach() function to attach our working data to the R search path; that is, this data will be searched by R when you call a variable in subsequent analysis. Otherwise, you will need to use the $ operator to identify the search path (e.g., use cps09mar$age to call age in this example).
library(tidyverse)
rm (list = ls())
setwd(getwd())
CPS09mar = read_csv ("cps09mar .csv") %>%
filter(female == 1) %>%
rename (inc = earnings, educ = education) %>%
mutate(loginc = log(inc), age2 = age ˆ2) %>%
select(-female, -inc) %>%
select(loginc, age, age2, educ, hisp, union) %>%
drop_na (loginc, age, age2, educ, hisp, union)
attach (CPS09mar)
We use the sumtable() function in the vtable package to generate some summary statistics. library(vtable)
sumtable (CPS09mar, vars = c ("loginc" , "age" , "educ" , "hisp" , "union"))
Linear Regression and OLS Estimator
It is well-known (from e.g., ECON2300/7310) that one can use R’s built-in function lm() to carry out linear
regression and calculate the OLS estimator. Here we apply it to the following model:
loginc = β0 + β1 age + β2 age2 + β3 educ + β4hisp + β5union +“.
ols .lm = lm(loginc ~ age + age2 + educ + hisp + union)
summary (ols .lm)
##
## Call:
## lm(formula = loginc ~ age + age2 + educ + hisp + union)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9 .3643 -0 .2869 0 .0028 0 .3032 2 .7521
##
## Coefficients:
## Estimate Std . Error t value Pr(> |t |)
## (Intercept) 7 .629e+00 4 .757e-02 160 .373 < 2e-16 ***
## age 5 .407e-02 2 .116e-03 25 .560 < 2e-16 ***
## age2 -5 .301e-04 2 .449e-05 -21 .642 < 2e- 16 ***
## educ 1 . 142e-01 1 .435e-03 79 .546 < 2e- 16 ***
## hisp -4 .850e-02 1 .083e-02 -4 .476 7 .64e-06 ***
## union 2 . 116e-02 2 .577e-02 0 .821 0 .412
## ---
## Signif . codes: 0 '*** ' 0.001 '** ' 0.01 '* ' 0.05 ' . ' 0.1 ' ' 1
##
## Residual standard error: 0 .5312 on 21596 degrees of freedom
## Multiple R-squared: 0 .2807, Adjusted R-squared: 0 .2805
## F-statistic: 1685 on 5 and 21596 DF, p-value: < 2 .2e-16
Computing OLS Estimates using Matrix Operations
OLS estimator has a closed-form formula which is more convenient to be expressed with matrix notation. Let yi := loginci , xi := (1, agei , agei(2) , educi , hispi , unioni ), and β := (β0 , β 1 , β2 , β3 , β4 , β5 ). Then define
╱. y(y)2(1) 、. ╱. x(x) 、. ╱. u(u)2(1) 、.
. . . . . .
and write the model as
Y = Xβ + u.
The OLS estimator then can be expressed as
βˆLⅠS = (XT X ) − 1 XT Y.
We use this result and R’s matrix operators to replicate the regression output obtained using lm(). The calculation of standard errors uses the formula (4.42) in [H] ((4.22) in [CT]). Here the calculation involves matrix multiplication %!%, transpose t(), matrix inverse solve(), and extracting diagonal elements of a matrix diag(). We use the as .matrix() function to turn the data object into a matrix. Note that most R operations (e.g., add +, subtraction -, multiplication *, and division /) are element-wise, or say "e女torized,
meaning that operations occur in parallel in certain R objects.
X = as .matrix(cbind(1, CPS09mar[,-1]))
Y = as .matrix (CPS09mar[,1])
b .mat = solve(t(X)%*%X)%*%(t(X)%*%Y)
e = Y - X%*%b .mat
s2 = sum (eˆ2)/(nrow (X)-ncol (X))
SE .mat = sqrt(diag(s2*solve(t(X)%*%X)))
ols .mat = cbind(b .mat,SE .mat, b .mat/SE .mat, 2*(1-pnorm (abs(b .mat/SE .mat)))) colnames(ols .mat) = c ( "Estimate" , "Std . Error" , "t value" , "Pr(> |t |)") ols .mat
## Estimate Std . Error t value Pr(> |t |)
## 1 7 .6290797451 4 .757082e-02 160 .3730816 0 .000000e+00
## age 0 .0540733628 2 .115578e-03 25 .5596203 0 .000000e+00
## age2 -0 .0005300467 2 .449176e-05 -21 .6418409 0 .000000e+00
## educ 0 . 1141867419 1 .435484e-03 79 .5458050 0 .000000e+00
## hisp -0 .0484977354 1 .083480e-02 -4 .4761077 7 .601616e-06
## union 0 .0211579196 2 .577114e-02 0 .8209929 4 . 116503e-01
We can also replicate the R2 by calculating the ratio of the sample variances of and Y . Y .hat = Y - e
R .squared = var (Y .hat)/var (Y)
as .numeric (R .squared)
## [1] 0 .2806581
Computing OLS Estimates via Minimizing Sum of Squared Residuals
As its name (ordinary least squares) suggests, OLS estimators can be obtained via minimizing the sum of
squared residuals (SSR), i.e.,
βˆLⅠS n n (yi ì xi(T)b)2 .
i=1
To do this, we first need to define the objective function for the minimization procedure:
n
Q(b) := (yi ì xi(T)b)2 .
i=1
Note that like almost all objective functions used in statistics/econometrics, Q(b) essentially has two arguments, the unknown parameter (to estimate) and the observed data. Since the optimization is with respect to b, we treat b as a vector of variables and data as fixed values, and hence for given data, the value of objective function Q only depends on b.1
OLS .obj = function(b, data) {
Y = as .matrix(data[,1]); X = as .matrix(cbind(1 ,data[,2:ncol(data)]))
return(sum ((Y - X%*%b)ˆ2))
}
We then apply a quasi-Newton algorithm (“BFGS”) to search the minimum of Q(b). To implement the algorithm, we need to choose an initial value. For linear regression, this choice has no effects on the results as Q(b) is convex and differentiable. In such cases, starting from any initial value, Newton-type algorithm can always find the global minimum. In this example, we simply draw a random vector from the uniform distribution with support ( ì10, 10).
initial = runif(ncol(CPS09mar), - 10 , 10)
ols .optim = optim(par = initial, fn = OLS .obj, method = c ( "BFGS"), data = CPS09mar) ols .optim
## $par
## [1] 7 .6290797450 0 .0540733628 -0 .0005300467 0 . 1141867419 -0 .0484977354
## [6] 0 .0211579196
##
## $value
## [1] 6093 .074
##
## function gradient
## 77 11
##
## $convergence
## [1] 0
##
## $message
## NULL
In fact, OLS is perhaps the simplest special case of the so-called “M-estimation” (the “M” stands for maximization/minimization). As will be clear in the rest of this semester, all M-estimations proceed with conceptually the same procedure. Of course, objective functions are often not as well-behaved as Q(b) here, and so the optimization may involve more advanced algorithm. See https://cran.r-project.org/web/views/Op timization.html for a comprehensive review of optimization algorithms available in R.
Put OLS estimates obtained using the aforementioned three methods to see they do give the same results.
ols = cbind(ols .lm$coefficients, b .mat, ols .optim$par)
colnames(ols) = c ("ols .lm" , "ols .matrix" , "ols .optim")
ols
## ols .lm ols .matrix ols .optim
## (Intercept) 7 .6290797450 7 .6290797451 7 .6290797450
## age 0 .0540733628 0 .0540733628 0 .0540733628
## age2 -0 .0005300467 -0 .0005300467 -0 .0005300467
## educ 0 . 1141867419 0 . 1141867419 0 . 1141867419
## hisp -0 .0484977354 -0 .0484977354 -0 .0484977354
## union 0 .0211579196 0 .0211579196 0 .0211579196
2023-02-28