闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECON6300/7320 Advanced Microeconometrics

Tutorial 1: Introduction to R

Learning Objectives

At the end of this tutorial you should be able to

● use R to read, manipulate, and save data,

● use R to compute summary statistics,

● use R to run linear regression,

● use R to conduct basic matrix operations,

● deﬁne your own R functions and implement optimization algorithms.

References

● Peng, R.D., 2022. R Programming for Data Science. [P]

● Hansen, E.B. 2022. Econometrics. Princeton University Press. [H]

● Cameron, A.C., and P.K. Trivedi, 2005. Microeconometrics: Methods and Applications. Cambridge University Press. [CT]

Current Population Survey Data

The Current Population Survey (CPS) is a monthly survey of about 57,000 U.S. households conducted by the Bureau of the Census for the Bureau of Labor Statistics. The CPS is the primary source of information on the labor force characteristics of the U.S. population. The survey covers employment, earnings, educational attainment, income, poverty, health insurance coverage, job experience, voting and registration, computer

usage, veteran status, and other variables. Details can be found at https://www.census.gov/program- surveys/cps.html.

Here we use a subset of March 2009 CPS survey data in which all individuals were full time employed. This

data has 50,742 observations and 12 variables (see the attached data description for more detailed information, e.g., variable deﬁnitions). We ﬁrst load the CSV data set to R using read_csv(), and then

1. keep only female observations;

2. rename two variables (earnings, education) as (inc, educ);

3. generate two new variables, log(inc) and age2;

4. drop two variables, female and inc;

5. keep (and order) six variables for regression analysis, (loginc, age, age2, educ, hisp, union); and

6. drop all observations with any missing values in these variables.

Note that %>% is the pipeline operator that can conveniently string together multiple dplyr (part of the tidyverse package) functions in a sequence of operations. Basically, it can directly pass the output of the previous operation to the next operation as the input. You no longer need to deﬁne a temporary variable for each step to store these results. This is very useful when dealing with complex data as it saves you from

Table 1: Summary Statistics

Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
loginc	21602	10.498	0.626	1.386	10.127	10.859	13.054
age	21602	42.244	11.453	15	33	51	80
educ	21602	14.066	2.604	0	12	16	20
hisp	21602	0.139	0.346	0	0	0	1
union	21602	0.02	0.14	0	0	0	1

deﬁning a large number of “single-use” variables, and at the same time makes your codes more readable. See [P] Section 13.11 for more illustrating examples.

Finally, we use the attach() function to attach our working data to the R search path; that is, this data will be searched by R when you call a variable in subsequent analysis. Otherwise, you will need to use the $ operator to identify the search path (e.g., use cps09mar$age to call age in this example).

library(tidyverse)

rm (list = ls())

setwd(getwd())

CPS09mar = read_csv ("cps09mar .csv") %>%

filter(female == 1) %>%

rename (inc = earnings, educ = education) %>%

mutate(loginc = log(inc), age2 = age ˆ2) %>%

select(-female, -inc) %>%

select(loginc, age, age2, educ, hisp, union) %>%

drop_na (loginc, age, age2, educ, hisp, union)

attach (CPS09mar)

We use the sumtable() function in the vtable package to generate some summary statistics. library(vtable)

sumtable (CPS09mar, vars = c ("loginc" , "age" , "educ" , "hisp" , "union"))

Linear Regression and OLS Estimator

It is well-known (from e.g., ECON2300/7310) that one can use R’s built-in function lm() to carry out linear

regression and calculate the OLS estimator. Here we apply it to the following model:

loginc = β0 + β1 age + β2 age2 + β3 educ + β4hisp + β5union +“.

ols .lm = lm(loginc ~ age + age2 + educ + hisp + union)

summary (ols .lm)

## Call:

## lm(formula = loginc ~ age + age2 + educ + hisp + union)

## Residuals:

## Min 1Q Median 3Q Max

## -9 .3643 -0 .2869 0 .0028 0 .3032 2 .7521

## Coefficients:

## Estimate Std . Error t value Pr(> |t |)

## (Intercept) 7 .629e+00 4 .757e-02 160 .373 < 2e-16 ***

## age 5 .407e-02 2 .116e-03 25 .560 < 2e-16 ***

## age2 -5 .301e-04 2 .449e-05 -21 .642 < 2e- 16 ***

## educ 1 . 142e-01 1 .435e-03 79 .546 < 2e- 16 ***

## hisp -4 .850e-02 1 .083e-02 -4 .476 7 .64e-06 ***

## union 2 . 116e-02 2 .577e-02 0 .821 0 .412

## ---

## Signif . codes: 0 '*** ' 0.001 '** ' 0.01 '* ' 0.05 ' . ' 0.1 ' ' 1

## Residual standard error: 0 .5312 on 21596 degrees of freedom

## Multiple R-squared: 0 .2807, Adjusted R-squared: 0 .2805

## F-statistic: 1685 on 5 and 21596 DF, p-value: < 2 .2e-16

Computing OLS Estimates using Matrix Operations

OLS estimator has a closed-form formula which is more convenient to be expressed with matrix notation. Let yi := loginci , xi := (1, agei , agei(2) , educi , hispi , unioni ), and β := (β0 , β 1 , β2 , β3 , β4 , β5 ). Then deﬁne

╱． y(y)2(1) 、． ╱． x(x) 、． ╱． u(u)2(1) 、．

．．．．．．

and write the model as

Y = Xβ + u.

The OLS estimator then can be expressed as

βˆLⅠS = (XT X ) − 1 XT Y.

We use this result and R’s matrix operators to replicate the regression output obtained using lm(). The calculation of standard errors uses the formula (4.42) in [H] ((4.22) in [CT]). Here the calculation involves matrix multiplication %!%, transpose t(), matrix inverse solve(), and extracting diagonal elements of a matrix diag(). We use the as .matrix() function to turn the data object into a matrix. Note that most R operations (e.g., add +, subtraction -, multiplication *, and division /) are element-wise, or say "e女torized,

meaning that operations occur in parallel in certain R objects.

X = as .matrix(cbind(1, CPS09mar[,-1]))

Y = as .matrix (CPS09mar[,1])

b .mat = solve(t(X)%*%X)%*%(t(X)%*%Y)

e = Y - X%*%b .mat

s2 = sum (eˆ2)/(nrow (X)-ncol (X))

SE .mat = sqrt(diag(s2*solve(t(X)%*%X)))

ols .mat = cbind(b .mat,SE .mat, b .mat/SE .mat, 2*(1-pnorm (abs(b .mat/SE .mat)))) colnames(ols .mat) = c ( "Estimate" , "Std . Error" , "t value" , "Pr(> |t |)") ols .mat

## Estimate Std . Error t value Pr(> |t |)

## 1 7 .6290797451 4 .757082e-02 160 .3730816 0 .000000e+00

## age 0 .0540733628 2 .115578e-03 25 .5596203 0 .000000e+00

## age2 -0 .0005300467 2 .449176e-05 -21 .6418409 0 .000000e+00

## educ 0 . 1141867419 1 .435484e-03 79 .5458050 0 .000000e+00

## hisp -0 .0484977354 1 .083480e-02 -4 .4761077 7 .601616e-06

## union 0 .0211579196 2 .577114e-02 0 .8209929 4 . 116503e-01

We can also replicate the R2 by calculating the ratio of the sample variances of and Y . Y .hat = Y - e

R .squared = var (Y .hat)/var (Y)

as .numeric (R .squared)

## [1] 0 .2806581

Computing OLS Estimates via Minimizing Sum of Squared Residuals

As its name (ordinary least squares) suggests, OLS estimators can be obtained via minimizing the sum of

squared residuals (SSR), i.e.,

βˆLⅠS n n (yi ì xi(T)b)2 .

i=1

To do this, we ﬁrst need to deﬁne the objective function for the minimization procedure:

Q(b) := (yi ì xi(T)b)2 .

i=1

Note that like almost all objective functions used in statistics/econometrics, Q(b) essentially has two arguments, the unknown parameter (to estimate) and the observed data. Since the optimization is with respect to b, we treat b as a vector of variables and data as ﬁxed values, and hence for given data, the value of objective function Q only depends on b.1

OLS .obj = function(b, data) {

Y = as .matrix(data[,1]); X = as .matrix(cbind(1 ,data[,2:ncol(data)]))

return(sum ((Y - X%*%b)ˆ2))

}

We then apply a quasi-Newton algorithm (“BFGS”) to search the minimum of Q(b). To implement the algorithm, we need to choose an initial value. For linear regression, this choice has no eﬀects on the results as Q(b) is convex and diﬀerentiable. In such cases, starting from any initial value, Newton-type algorithm can always ﬁnd the global minimum. In this example, we simply draw a random vector from the uniform distribution with support ( ì10, 10).

initial = runif(ncol(CPS09mar), - 10 , 10)

ols .optim = optim(par = initial, fn = OLS .obj, method = c ( "BFGS"), data = CPS09mar) ols .optim

## $par

## [1] 7 .6290797450 0 .0540733628 -0 .0005300467 0 . 1141867419 -0 .0484977354

## [6] 0 .0211579196

## $value

## [1] 6093 .074

## $counts

## function gradient

## 77 11

## $convergence

## [1] 0

## $message

## NULL

In fact, OLS is perhaps the simplest special case of the so-called “M-estimation” (the “M” stands for maximization/minimization). As will be clear in the rest of this semester, all M-estimations proceed with conceptually the same procedure. Of course, objective functions are often not as well-behaved as Q(b) here, and so the optimization may involve more advanced algorithm. See https://cran.r-project.org/web/views/Op timization.html for a comprehensive review of optimization algorithms available in R.

Put OLS estimates obtained using the aforementioned three methods to see they do give the same results.

ols = cbind(ols .lm$coefficients, b .mat, ols .optim$par)

colnames(ols) = c ("ols .lm" , "ols .matrix" , "ols .optim")

ols

## ols .lm ols .matrix ols .optim

## (Intercept) 7 .6290797450 7 .6290797451 7 .6290797450

## age 0 .0540733628 0 .0540733628 0 .0540733628

## age2 -0 .0005300467 -0 .0005300467 -0 .0005300467

## educ 0 . 1141867419 0 . 1141867419 0 . 1141867419

## hisp -0 .0484977354 -0 .0484977354 -0 .0484977354

## union 0 .0211579196 0 .0211579196 0 .0211579196

2023-02-28