Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECON 178 S122:

Final Project Guidelines

Overview of the data

The data is from the 1991 Survey of Income and Program Participation (SIPP). You are provided with 7933 observations.

The sample contains households data in which the reference persons aged 25-64 years old. At least one person is employed, and no one is self-employed. The observation units correspond to the household    reference persons.

The data set contains a number of feature variables that you can            choose to predict total wealth. The outcome variable (total wealth) and feature variables are described in the next slide.

Dataframe with the following variables

Variable to predict (outcome variable):

tw: total wealth (in US $).

•  Total wealth equals net financial assets, including        Individual Retirement Account (IRA) and 401(k) assets, plus housing equity plus the value of business,              property, and motor vehicles.

Variables related to retirement (features):

ira: individual retirement account (IRA)  (in US $).

e401: 1 if eligible for 401(k), 0 otherwise

Financial variables (features):

nifa: non-401k financial assets (in US $).

inc: income (in US $).

Variables related to home ownership (features):

hmort: home mortgage (in US $).

hval: home value (in US $).

hequity: home value minus home mortgage.

Other covariates (features):

educ: education (in years).

male: 1 if male, 0 otherwise.

twoearn: 1 if two earners in the household, 0 otherwise.

nohs, hs, smcol, col: dummies for education: no high- school, high-school, some college, college.

age: age.

fsize: family size.

marr: 1 if married, 0 otherwise.

What is 401k and IRA?

•   Both 401k and IRA are tax deferred savings options which aims to increase individual saving for retirement

•   The 401(k) plan:

•   a company-sponsored retirement account where employees can contribute

•   employers can match a certain % of an employee’s contribution

•   401(k) plans are offered by employers -- only employees in companies offering such plans can participate

•   The feature variable e401 contains information on the eligibility

•   IRA accounts:

•   Individuals can participate

•   No employer matching

•   The feature variable ira contains IRA account (in US $)

Your tasks

●    Build a prediction/fitted model to predict total wealth (tw) in US dollars

●    Write up a paper, up to 20 pages (not including the code), 11 size font, and 1.5 spacing

○ Introduction

■ Briefly state the objectives of the study

○ Statistical analyses

■ Describe how you apply the tools you have learned from this course to perform the prediction task

■ You should try different methods and compare their prediction performance and interpretability

○ Conclusions

■ Summarize what you have discovered from this project

■ (Optional) Discuss caveats to the conclusions drawn from your analyses

●    Bonus points

o We kept 20% of the sample on which we are going to run your proposed model and method. We will rank the students by accuracy of the prediction on that 20% of the sample.

●    The project is due on July 29 (by 5:00pm PST). Please submit your paper and code according to the instructions. Late assignment will NOT be accepted except with my prior consent regarding unusual circumstances permitted by University policies (proper documentations will be needed)

Grading policy

• First, please follow the policy on academic integrity stated in the syllabus:

You are not allowed to work together with others on the final project and the bonus opportunity; you are not allowed to get any help (including but not limited to program code) from others (except the instructor and the TA) on the final project and the bonus opportunity.

•    We will use tools to catch any form of plagiarism and cheating. Penalties on cheating include, among others, a failing grade for the course. In addition, the Council of Deans of Student Affairs will impose a disciplinary penalty.

Every student in ECON178 must read, understand, agree and sign the integrity pledge (https://academicintegrity.ucsd.edu/forms/form-pledge.html) before completing any assignment for ECON178. After you sign the pledge form, a receipt will be emailed to you. Please include this receipt in the submission of your assignment.

• Second, the maximum points (without the bonus points) you can get for the project is 40 points. Your project grade counts 55% of your course grade. Slide 7 provides a break down of the points and how your project is graded.

• Third, there are a maximum of 40 bonus points awarded on the base of how good your out-of-sample prediction is. The best prediction receives 40 points. The second best prediction receives less than 40 points, and so on. The bonus points you earn count 5% of your course grade.

• Fourth, the bonus points can only benefit your final grade. We will curve the grades without the bonus points first. Say if you are in the A bracket, you will stay in the A bracket even if you get zero bonus points. On the other hand, if you are in the A- bracket but you get enough bonus points to move your final grade to the A bracket, then you will get an A in the  end.

• Fifth, it is entirely possible that you get the maximum points on the project but zero bonus points. After all, luck may be     needed to get a high enough accuracy on the out-of-sample prediction. But as explained above, you will never be                 penalized for not having luck. Having said this, we still expect harder work is more likely to lead to higher bonus points. So, you should put in your best effort.

Grading

0-10 points

10-30 points

30-40 points

Analysis (50% of total points)

analysis is overly simplistic or

inappropriate; little or no

justification for choices of

analyses is provided

analysis is appropriate; some

justification for choices of

analyses is provided

analysis is appropriate and

informative; detailed justification

for choices of analyses is

provided

Results (25% of total points)

Conclusions are missing,

incorrect, or not drawn from

analysis; plots or tables are

inappropriate

Conclusions are sensible and

drawn from analysis; plots or

tables are appropriate

Conclusions are not only drawn

from analysis but also insightful;

plots or tables are nicely

presented and facilitate

conveying the information

Code (15% of total points)

Code doesn't run; or codes

outputs do not match the results

described in the paper

Code runs and codes outputs

mostly match the results

described in the paper

Code runs and codes outputs

match the results described in

the paper; codes are neat and

easy to read; no irrelevant code

Paper writing (10% of total

points)

Writing is poor, illogical, or incoherent

Writing is mostly logical and coherent

Writing is crystal clear, logical, and coherent

Note: The TA will give a couple examples in your discussion section on what we mean by giving justification for choices of analyses”

How to carry out this project?

Data can be found on Canvas

Download the data and save it in your working directory

To load the data into R, use the code:

data_tr <- read.table("data_tr.txt", header = TRUE, sep = "\t", dec = ".")[,-1]

•    Inspecting your data and preliminary analyses

•    Dependent variable (Y): tw: total wealth (in US $)

•    Predictors (X): your choice (but please make sensible choices)

•    Some suggestions: use scatter plots and/or simple linear regressions with OLS to visualize basic relationships between total wealth and various predictors

•    In-depth analyses

• What could be the X variables in your prediction exercise?

• What methods should you use? (OLS, Ridge, Stepwise selections, Lasso)

• How do you select the best prediction/fitted model (K-fold cross validation, Leave- one-out)

What could be the X variables in your prediction exercise?

●The plain predictors listed on Slide 3

Watch out for perfect collinearity: You do not want to include predictors that are perfect collinear.

■ For example, you don’t want to include hmort (home mortgage), hval (home value), and hequity (home value minus home mortgage) all three at the same time because hequity = hval-hmort. One solution to this – drop hequity from your models

■ As another example of perfect collinearity, say you include the intercept term (a column of “1”s) and all four dummy variables nohs, hs, smcol, col (no high-school, high-school, some college, college), note that nohs+hs+smcol+col = columns of 1 (the intercept). One solution to this -- drop one of the education        dummies from your models

●Transformations of the plain predictors listed on Slide 3: use what you have learned from Topic 6:

Flexible Linear Models

○ Polynomial transformation

○ The spline basis representation

○ Transformation using binary indicators

○ Generalized additive models (GAM)

○ Interacting dummy variables with other variables; for example, age x twoearn

● Before transforming the plain predictors, scatter plots may help you to visualize how each predictor is associated with the total wealth. For example, you may see a nonlinear relationship so you might want to consider some type of polynomial transformation or the spline basis representation

Collection of methods

We have already seen:

•  OLS

•  Ridge regressions

•  Stepwise selection methods

Lasso

Note:

1. In the project, you should select different methods from the list above and

compare their prediction performance and interpretability

2. For Ridge, Stepwise selection, and Lasso, dont forget the use of Cross-

Validation

3. In addition to prediction performance, you might want to think about

whether the set of predictors used to predict total wealth make intuitive

sense

Compare the prediction performances of different methods (an example)

•    Partition the ENTIRE data into a training set and test set

•    Say, you have applied the Ridge regression procedure and the Lasso procedure

•     For Ridge, you use the K-fold CV (Slide 12) to choose the best (call it ).

•     For Lasso, you also use the K-fold CV (Slide 12) to choose the best (call it ).

doesn’t necessarily equal to

•    Which method do you choose? Ridge or Lasso?

•    You use Ridge with and Lasso with , respectively, to predict the outcomes with the predictors in the test set, and compute the (also called MSPE)

•     If MSEte is substantially larger than MSEte , choose Lasso; otherwise,

choose Ridge

•     If MSEte and MSEte are similar, choose one that you feel the resulting

fitted model is easier to understand (e .g ., one that with fewer predictors and the

predictors are intuitive)


K-fold cross validation

1.       Partition the training data into separate sets of equal size

= (1, 2, … , ); e.g., K = 5 10

2.       For a given and each = 1,2, … , , estimate the model with all data excluding

•    Denote the obtained model by ,(⋅)

3.       Predict the outcomes for with the model from Step 2 and the input data in

•    The predicted outcomes are , where

4.       Compute the sample mean squared (prediction) error for , known as the CV

prediction error:

= −1 σ , , 2