闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECON 178 S122:

Final Project Guidelines

Overview of the data

The data is from the 1991 Survey of Income and Program Participation (SIPP). You are provided with 7933 observations.

The sample contains households data in which the reference persons aged 25-64 years old. At least one person is employed, and no one is self-employed. The observation units correspond to the household reference persons.

The data set contains a number of feature variables that you can choose to predict total wealth. The outcome variable (total wealth) and feature variables are described in the next slide.

Dataframe with the following variables

Variable to predict (outcome variable):

• tw: total wealth (in US $).

• Total wealth equals net financial assets, including Individual Retirement Account (IRA) and 401(k) assets, plus housing equity plus the value of business, property, and motor vehicles.

Variables related to retirement (features):

• ira: individual retirement account (IRA) (in US $).

• e401: 1 if eligible for 401(k), 0 otherwise

Financial variables (features):

• nifa: non-401k financial assets (in US $).

• inc: income (in US $).

Variables related to home ownership (features):

• hmort: home mortgage (in US $).

• hval: home value (in US $).

• hequity: home value minus home mortgage.

Other covariates (features):

• educ: education (in years).

• male: 1 if male, 0 otherwise.

• twoearn: 1 if two earners in the household, 0 otherwise.

• nohs, hs, smcol, col: dummies for education: no high- school, high-school, some college, college.

• age: age.

• fsize: family size.

• marr: 1 if married, 0 otherwise.

What is 401k and IRA?

• Both 401k and IRA are tax deferred savings options which aims to increase individual saving for retirement

• The 401(k) plan:

• a company-sponsored retirement account where employees can contribute

• employers can match a certain % of an employee’s contribution

• 401(k) plans are offered by employers -- only employees in companies offering such plans can participate

• The feature variable e401 contains information on the eligibility

• IRA accounts:

• Individuals can participate

• No employer matching

• The feature variable ira contains IRA account (in US $)

Your tasks

● Build a prediction/fitted model to predict total wealth (tw) in US dollars

● Write up a paper, up to 20 pages (not including the code), 11 size font, and 1.5 spacing

○ Introduction

■ Briefly state the objectives of the study

○ Statistical analyses

■ Describe how you apply the tools you have learned from this course to perform the prediction task

■ You should try different methods and compare their prediction performance and interpretability

○ Conclusions

■ Summarize what you have discovered from this project

■ (Optional) Discuss caveats to the conclusions drawn from your analyses

● Bonus points

o We kept 20% of the sample on which we are going to run your proposed model and method. We will rank the students by accuracy of the prediction on that 20% of the sample.

● The project is due on July 29 (by 5:00pm PST). Please submit your paper and code according to the instructions. Late assignment will NOT be accepted except with my prior consent regarding unusual circumstances permitted by University policies (proper documentations will be needed)

Grading policy

• First, please follow the policy on academic integrity stated in the syllabus:

• You are not allowed to work together with others on the final project and the bonus opportunity; you are not allowed to get any help (including but not limited to program code) from others (except the instructor and the TA) on the final project and the bonus opportunity.

• We will use tools to catch any form of plagiarism and cheating. Penalties on cheating include, among others, a failing grade for the course. In addition, the Council of Deans of Student Affairs will impose a disciplinary penalty.

• Every student in ECON178 must read, understand, agree and sign the integrity pledge (https://academicintegrity.ucsd.edu/forms/form-pledge.html) before completing any assignment for ECON178. After you sign the pledge form, a receipt will be emailed to you. Please include this receipt in the submission of your assignment.

• Second, the maximum points (without the bonus points) you can get for the project is 40 points. Your project grade counts 55% of your course grade. Slide 7 provides a break down of the points and how your project is graded.

• Third, there are a maximum of 40 bonus points awarded on the base of how good your out-of-sample prediction is. The best prediction receives 40 points. The second best prediction receives less than 40 points, and so on. The bonus points you earn count 5% of your course grade.

• Fourth, the bonus points can only benefit your final grade. We will curve the grades without the bonus points first. Say if you are in the A bracket, you will stay in the A bracket even if you get zero bonus points. On the other hand, if you are in the A- bracket but you get enough bonus points to move your final grade to the A bracket, then you will get an A in the end.

• Fifth, it is entirely possible that you get the maximum points on the project but zero bonus points. After all, luck may be needed to get a high enough accuracy on the out-of-sample prediction. But as explained above, you will never be penalized for not having luck. Having said this, we still expect harder work is more likely to lead to higher bonus points. So, you should put in your best effort.

Grading

	0-10 points	10-30 points	30-40 points
Analysis (50% of total points)	analysis is overly simplistic or inappropriate; little or no justification for choices of analyses is provided	analysis is appropriate; some justification for choices of analyses is provided	analysis is appropriate and informative; detailed justification for choices of analyses is provided
Results (25% of total points)	Conclusions are missing, incorrect, or not drawn from analysis; plots or tables are inappropriate	Conclusions are sensible and drawn from analysis; plots or tables are appropriate	Conclusions are not only drawn from analysis but also insightful; plots or tables are nicely presented and facilitate conveying the information
Code (15% of total points)	Code doesn't run; or code’s outputs do not match the results described in the paper	Code runs and code’s outputs mostly match the results described in the paper	Code runs and code’s outputs match the results described in the paper; codes are neat and easy to read; no irrelevant code
Paper writing (10% of total points)	Writing is poor, illogical, or incoherent	Writing is mostly logical and coherent	Writing is crystal clear, logical, and coherent

Note: The TA will give a couple examples in your discussion section on what we mean by “giving justification for choices of analyses”

How to carry out this project?

• Data can be found on Canvas

• Download the data and save it in your working directory

• To load the data into R, use the code:

data_tr <- read.table("data_tr.txt", header = TRUE, sep = "\t", dec = ".")[,-1]

• Inspecting your data and preliminary analyses

• Dependent variable (Y): tw: total wealth (in US $)

• Predictors (X): your choice (but please make sensible choices)

• Some suggestions: use scatter plots and/or simple linear regressions with OLS to visualize basic relationships between total wealth and various predictors

• In-depth analyses

• What could be the X variables in your prediction exercise?

• What methods should you use? (OLS, Ridge, Stepwise selections, Lasso)

• How do you select the best prediction/fitted model (K-fold cross validation, Leave- one-out)

What could be the X variables in your prediction exercise?

●The plain predictors listed on Slide 3

○ Watch out for perfect collinearity: You do not want to include predictors that are perfect collinear.

■ For example, you don’t want to include hmort (home mortgage), hval (home value), and hequity (home value minus home mortgage) all three at the same time because hequity = hval-hmort. One solution to this – drop hequity from your models

■ As another example of perfect collinearity, say you include the intercept term (a column of “1”s) and all four dummy variables nohs, hs, smcol, col (no high-school, high-school, some college, college), note that nohs+hs+smcol+col = columns of 1 (the intercept). One solution to this -- drop one of the education dummies from your models

●Transformations of the plain predictors listed on Slide 3: use what you have learned from Topic 6:

Flexible Linear Models

○ Polynomial transformation

○ The spline basis representation

○ Transformation using binary indicators

○ Generalized additive models (GAM)

○ Interacting dummy variables with other variables; for example, age x twoearn

● Before transforming the plain predictors, scatter plots may help you to visualize how each predictor is associated with the total wealth. For example, you may see a nonlinear relationship so you might want to consider some type of polynomial transformation or the spline basis representation

Collection of methods

We have already seen:

• OLS

• Ridge regressions

• Stepwise selection methods

• Lasso

Note:

1. In the project, you should select different methods from the list above and

compare their prediction performance and interpretability

2. For Ridge, Stepwise selection, and Lasso, don’t forget the use of Cross-

Validation

3. In addition to prediction performance, you might want to think about

whether the set of predictors used to predict total wealth make intuitive

sense

Compare the prediction performances of different methods (an example)

• Partition the ENTIRE data into a training set and test set

• Say, you have applied the Ridge regression procedure and the Lasso procedure

• For Ridge, you use the K-fold CV (Slide 12) to choose the best (call it ).

• For Lasso, you also use the K-fold CV (Slide 12) to choose the best (call it ).

• doesn’t necessarily equal to

• Which method do you choose? Ridge or Lasso?

• You use Ridge with and Lasso with , respectively, to predict the outcomes with the predictors in the test set, and compute the (also called MSPE)