Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Provide your answers to each question, including relevant figures (SAS Viya outputs) in a word/pdf document.

Task 1: Utilizing the 'HOUSEPRICE' dataset, address the following questions

Q1. Conduct a simple exploratory data analysis (EDA) to gain insights into the data's characteristics. Throughout this process, identify the attributes that you would select to predict the 'SalePrice' and assess the presence of multicollinearity.

Q2. Examine the data for potential issues such as duplicates and formatting inconsistencies in the selected attributes. Perform data preprocessing if necessary. If data preprocessing is not required, provide explanations supported by relevant validations.

Q3.Develop a machine learning model to predict the 'SalePrice' and evaluate its performance. Describe the specific type of model you selected, the rationale behind your choice, and provide justification for your model selection. Additionally, comment on the accuracy of the model.

Q4.Determine whether the developed model is statistically significant. Explain your answer.

Task 2: Utilizing the ‘FLCRASH’ dataset, address the following questions

Q1. Create a new custom category variable based on ‘Total Crash Injuries’ variable. This new custom category variable should contain only two categories. One category is injuries equal to zero, while the other category is for crashes with one or more injuries. Visualize the frequency of the two new categories you just created on a bar chart. How many crashes report zero injuries?

Q2. In Q1, you created a new categorical variable with only two values (binary). Your task now is to develop two models that can predict the value this target variable takes, given other explanatory variables. In other words, you attempt to predict if a crash is going to result in injuries (or not) given other important variables.

a. What are the two models (or techniques) you can use to predict this target variable?

b. Create one model to predict the target variable you created in Q1. Assess this model’s accuracy. What are the most important variables in predicting this target variable?

c. Create the second model to predict the target variable. Assess this model’s accuracy. What are the most important variables identified by the model to predict the target variable.

d. Compare the performance of the two models. Report and discuss the results of your comparison. Which model is the champion?

Task 3: Utilizing the ‘INSURANCE’ dataset, address the following questions:

Q1.Define new custom category variables for Age, BMI, and Charges based on the following criteria:

a. For the Age variable: i. 0-17 years: Categorized as "Children."

ii. 18-24 years: Categorized as "Young adults."

iii. 25-44 years: Categorized as "Adults."

iv. 45-64 years: Categorized as "Middle-aged adults."

v. 65 years and above: Categorized as "Seniors" or "Older adults."

b. For the BMI variable:

i. BMI less than 18.5: Categorized as "Underweight."

ii. BMI between 18.5 and 24.9: Categorized as "Normal weight."

iii. BMI between 25 and 29.9: Categorized as "Overweight."

iv. BMI equal to or greater than 30: Categorized as "Obese."

c. For the Charges variable:

i. Charges less than or equal to 12000: Categorized as "Low."

ii. Charges greater than 12000 and less than or equal to 40000: Categorized as "Moderate."

iii. Charges greater than 40000: Categorized as "High."

Q2.Develop a machine learning model capable of identifying specific groups of individuals based on available attributes such as sex, smoking status, etc., as well as custom categories created in Q1. For example, consider a potential group consisting of middle-aged adults with higher charges associated with smoking and higher BMI values. Justify the choice of the machine learning model for this task and provide an analysis of its accuracy and applicability.

Q3. Why is it not possible to forecast charges in the provided dataset? If that issue was not a limitation, which type of model would you employ for forecasting? What would be the most suitable time frequency to aggregate insurance charges, such as weekly, monthly, annually, or daily? What factors would you consider as essential for predicting insurance charges?