闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AD699: Data Mining for Business Analytics

Individual Assignment #2

Spring 2023

You will submit two ﬁles via Blackboard:

(1) Your write-up. This should be a PDF that includes your written answers to any questions that ask for written answers, along with the other things asked for in the prompt.

(2) Your R Script. This is the script that you will use to write your assignment. If you use Markdown, you’ll submit an .RMD rather than a .R ﬁle.

As always, remember to take advantage of your available resources. For this assignment in particular, the video library can be quite helpful. As the course slogan says, “Get After It!”

For each step, your write-up should clearly display your code and your results. For any step in the prompt that includes a question, the question should be answered in written sentences.

This model will be used to predict the average reading scores of students in California.

Main Topics: Simple Linear Regression & Multiple Linear Regression

Tasks:

● Simple Linear Regression:

For this assignment, we will use the dataset Caschool, which comes from the Ecdat package. After you have installed Ecdat, and used the library() function to bring this package into your environment, you can bring this dataset into your environment in the following way:

> data(Caschool)

A dataset description can be found by using the help function in R: ?Caschool. Variables may be referred to here in the prompt slightly diferently from how they appear in the dataset.

1. Bring this dataset into your R environment.

2. Use either the str() function or the glimpse() function from dplyr to learn more about this dataset. After taking a look at the dataset description and seeing the results here, which variables in the dataset are numeric, and which are categorical?

3. Filter the dataset so that only rows from the 16 most common counties remain (this will leave you with just the counties that have 10 or more school districts).

4. Using your assigned seed value, create a data partition. Assign approximately 60% of the records to your training set, and the other 40% to your validation set.

a. Why is it important to partition the data before doing any sort of in-depth analysis of the variables?

Keep in mind that a seed value has no relationship to the data itself -- it’s just an arbitrary number. You can use any method that results in 60% of rows going to training, and 40% to validation, with no overlapping rows, no rows thrown away, and random selection.

5. Let’s explore the relationship between readscr (average reading score in a district) and mealpct (the percentage of students in the district who qualify for free and reduced price lunches, based on low family incomes). Using ggplot, create a scatterplot that depicts readscr on the y-axis and mealpct on the x-axis. Add a best-ﬁt line to this scatterplot. Use only your training set data to build this plot.

What does this plot suggest about the relationship between these variables? Does this make intuitive sense to you? Why or why not?

6. Now, again using training set data only, ﬁnd the correlation between readscr and

mealpct. Then, use cor.test() to see whether this correlation is signiﬁcant. What is this correlation? Is it a strong one? Is the correlation signiﬁcant?

7. Using your training set, create a simple linear regression model, with readscr as your outcome variable and mealpct as your input variable. Use the summary() function to display the results of your model.

8. What are the minimum and maximum residual values in this model?

a. Find the observation whose rating generated the highest residual value in your model. What was the district’s actual average reading score? What did the model predict that it would be? How is the residual calculated from the two numbers that you just found?

b. Find the observation whose rating generated the lowest residual value. What was the district’s actual average reading score? What did the model predict that it would be? How is the residual calculated from the two numbers that you just found?

c. It looks like there are some cases where this model is quite a bit “of the mark. ” Write a few sentences with your thoughts about why mealpct may not perfectly predict reading scores.

9. What is the regression equation generated by your model? Make up a hypothetical input value and explain what it would predict as an outcome. To show the predicted outcome value, you can either use a function in R, or just explain what the predicted outcome would be, based on the regression equation and some simple math.

10. Using the accuracy() function from the forecast package, assess the accuracy of your model against both the training set and the validation set. What is the purpose of making this comparison? Focus on RMSE and MAE here in particular.

11. How does your model’s RMSE compare to the standard deviation of reading scores in the training set? What can such a comparison teach us about the model?

● Multiple Linear Regression: (with one extra SLR model, too)

For this part of the assignment, use the same training set and the same validation set that you used in Part I.

1. Before we go any further, let’s clean things up a bit here by getting rid of some variables. For anything you remove, take it out of both your training set and your validation set.

a. The outcome variable that we used in the ﬁrst part of this assignment is one of three total test score variables in this dataset. We’ll re-use the reading score again here, but get rid of the other two test score variables now -- that will save us from possible problems later on.

b. Next -- if there are any categorical variables that have as many, or nearly as many, unique values as there are records in the dataset, get rid of them, too. Be careful here. To know whether a variable is categorical or numeric, you will sometimes have to read the dataset description. Lazily just using the str() results without thinking about variables’ meanings could cause problems here.

2. Build a correlation table in R that depicts the correlations among all of the numerical variables that you might use as predictors (use your training set to build this). Are there any variable relationships (.80 or higher) that suggest that multicollinearity could be an issue here? If so, for any strongly correlated variable pair, remove any variables that should be taken out of the model. If you removed any, how did you decide which ones to remove? If not, why did you keep the ones that you have left?

3. What are dummy variables? In a couple of sentences, describe what they are and explain their purpose. (Question #3 is a general question, and is not directly related to this particular dataset or problem).

4. Let’s try building a model with just a categorical input. (Note: Question 4 is unrelated to any other step -- you can think of it as being like its own “island”).

a. Pick any county in California -- it doesn’t matter which one. You can pick one that you’ve visited, lived in, heard of...or perhaps one whose name you happen to like. Which one did you pick?

b. Find the average (mean) reading score for your chosen county. (Note: There are many ways you could ﬁnd this in R. You could use a group_by() / summarize() sequence, or any other approach).

c. Now, build a simple linear regression model, with readscr as the outcome, and county as the input.

d. What does this model predict as the test score for your county?

e. What is the relationship between your answer to 4b and your answer to 4d? In a sentence or two, why does this make sense? (Note: No detailed statistical knowledge or outside sources are required for answering this).

5. Using backward elimination, build a multiple regression model with the data in your training set, with the goal of predicting the readscr variable. Start with all of the potential predictors that you have left (if you eliminated any in Step 1 or Step 2, don’t bring them back...they’re gone!)

a. Show a summary of your resulting multiple linear regression model.

6. Model metrics

a. What is the total sum of squares for your model? (SST). This can be found by summing all of the squared diferences from the mean for your outcome variable.

b. What is the total sum of squares due to regression for your model? (SSR). This can be found by summing all the squared diferences between the ﬁtted values and the mean for your outcome variable. Do not use any other SSR deﬁnition, besides the one listed here in the previous sentence.

c. What is your SSR / SST? Where can you also see this value in the summary of your regression model?

7. Getting from a t-value to a p-value. Choose one of the predictors from your model (it could be a numeric input variable or a single level from a categorical input). What is the t-value for that predictor? Using the visualize.t() function from the visualize package, create a plot of the t-distribution that shows the distribution for that t-value and the number of degrees of freedom in your model. What percent of the curve is shaded? How does this relate to the p-value for that predictor?

8. What is your model’s F-statistic? What does the F-Statistic measure? Using R, demonstrate where the F-Statistic comes from (you can use the formula/process shown in the class slides with the Sacramento example).

9. Make up a ﬁctional school district, and assign attributes to it for each of the predictors in your model. What does your model predict that this district’s average test scores will be? To answer this, you can use a function in R or just explain it using the equation and some simple math.

10. Using the accuracy() function from the forecast package, assess the accuracy of your model against both the training set and the validation set. What do you notice about these results? Describe your ﬁndings in a couple of sentences. In this section, you should talk about the overﬁtting risk and also about the way your MLR model difered from your SLR model in terms of accuracy.

2023-03-04

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言