Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STAT 101 A Summer A 2022

Predicting Car Prices

Problem Statement

A Chinese automobile company Geely Auto aspires to enter the US market by        setting up their manufacturing unit there and producing cars locally to give             competition to their US and European counterparts. They have contracted an          automobile consulting company to understand the factors on which the pricing of   cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

   Which variables are significant in predicting the price of a car.    How well those variables describe the price of a car

   Based on various market surveys, the consultingfirm has gathered a large

data set of different types of cars across the America market.

Business Goal

We are required to model the price of cars with the available independent         variables. It will be used by the management to understand how exactly the prices  vary with the independent variables. They can accordingly manipulate the design   of the cars, the business strategy etc. to meet certain price levels. Further, the         model will be a good way for management to understand the pricing dynamics of a new market.

Data Description:

The data at hand is divided into two data sets: Training and Testing.                The training data set contains 1500 observations and has 21 predictors and the response variable PriceNew.

The testing data set contains 938 observations and has 21 predictors and the response variable PriceNew.

I have already taken care of all missing values in the data sets.

The variables are described as follows:

   "Type"

   "MPG.highway"

   "AirBags"

   "DriveTrain"

   "Cylinders"

   "EngineSize"

   "Horsepower"

   "RPM"

   "Rev.per.mile"

   "Man.trans.avail"

   "Fuel.tank.capacity"

   "Passengers"

   "Length"

   "Wheelbase"

   "Width"

   "Turn.circle"

   "Rear.seat.room"

   "Luggage.room"

   "Weight"

   "Origin"

   "Make"

   "PriceNew"

Projects Main Goals

   Use the training data to build a valid MLR.

   Check diagnostics

   Compete to make your MLR model the best” it can be. (create new

variables out of existing ones, transformations, checking leverages and outliers, …etc.)

   Your Task is to predict the prices of the cars in the testing data and create a

solution file and submit it on kaggle to check your predictions accuracies.

   The Competition ranks students’ submissions based on their testing R2 .    Accurate, Valid and Simple are the best models.

   The submission file must have two columns with 938 rows: The first column

named Ob and the second named PriceNew” in a csv format only.

Key assumptions of Multiple Regression:

To perform multiple linear regression, the following assumptions must be met: --- Before model construction: ---

    Linear relationship: The dependent variable Y (i.e Price) has a linear relationship

with the independent variables X, and to verify this, one must ensure that the XY dispersion graph is linear.

    No multi-collinearity: Multiple regression assumes that independent variables X

are not strongly correlated with each other. This assumption is tested   using Variance Inflation Factor (VIF) or using Correlation Matrix.

--- After: Residual analysis of the model ---

    Normality of Error Distribution

    Independence of errors

    Homo-scedasticity

Grading Scheme:

1.  The First One-Third of the project’s grade is based on the Kaggle Rankings.

2.  The Second One-Third of the project’s grade is based on the Validity and the Simplicity of the final MLR model. (You are not allowed to use Machine      Learning functions and tools (Like Random Forests, Ridge Regression, ...

etc) to create your predictive model. lm, glm, step and regsubsets functions are allowed

3.  The Last One-Third of the project’s grade is based on the final paper writ- up. (No R-codes in your paper-unless it is in the appendix).

4.  Make sure you use your full name on your Kaggle account. Failing to do so, you may lose up to 10% of your final project grade.  A sample Name: “First Name Last Name”