STAT 101 A Summer A 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STAT 101 A Summer A 2022
Predicting Car Prices
Problem Statement
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
Which variables are significant in predicting the price of a car. How well those variables describe the price of a car
Based on various market surveys, the consultingfirm has gathered a large
data set of different types of cars across the America market.
Business Goal
We are required to model the price of cars with the available “independent” variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
Data Description:
The data at hand is divided into two data sets: Training and Testing. The training data set contains 1500 observations and has 21 predictors and the response variable PriceNew.
The testing data set contains 938 observations and has 21 predictors and the response variable PriceNew.
I have already taken care of all missing values in the data sets.
The variables are described as follows:
"Type"
"MPG.highway"
"AirBags"
"DriveTrain"
"Cylinders"
"EngineSize"
"Horsepower"
"RPM"
"Rev.per.mile"
"Man.trans.avail"
"Fuel.tank.capacity"
"Passengers"
"Length"
"Wheelbase"
"Width"
"Turn.circle"
"Rear.seat.room"
"Luggage.room"
"Weight"
"Origin"
"Make"
"PriceNew"
Project’s Main Goals
Use the training data to build a valid MLR.
Check diagnostics
Compete to make your MLR model the “best” it can be. (create new
variables out of existing ones, transformations, checking leverages and outliers, …etc.)
Your Task is to predict the prices of the cars in the testing data and create a
solution file and submit it on kaggle to check your predictions accuracies.
The Competition ranks students’ submissions based on their testing R2 . Accurate, Valid and Simple are the best models.
The submission file must have two columns with 938 rows: The first column
named Ob and the second named “PriceNew” in a csv format only.
Key assumptions of Multiple Regression:
To perform multiple linear regression, the following assumptions must be met: --- Before model construction: ---
Linear relationship: The dependent variable Y (i.e Price) has a linear relationship
with the independent variables X, and to verify this, one must ensure that the XY dispersion graph is linear.
No multi-collinearity: Multiple regression assumes that independent variables X
are not strongly correlated with each other. This assumption is tested using Variance Inflation Factor (VIF) or using Correlation Matrix.
--- After: Residual analysis of the model ---
Normality of Error Distribution
Independence of errors
Homo-scedasticity
Grading Scheme:
1. The First One-Third of the project’s grade is based on the Kaggle Rankings.
2. The Second One-Third of the project’s grade is based on the Validity and the Simplicity of the final MLR model. (You are not allowed to use Machine Learning functions and tools (Like Random Forests, Ridge Regression, ...
etc) to create your predictive model. “lm, glm, step and regsubsets functions are allowed”
3. The Last One-Third of the project’s grade is based on the final paper writ- up. (No R-codes in your paper-unless it is in the appendix).
4. Make sure you use your full name on your Kaggle account. Failing to do so, you may lose up to 10% of your final project grade. A sample Name: “First Name Last Name”
2022-07-22