关键词 > ETX2250/ETF5922

ETX2250/ETF5922: Data Visualization and Analytics

发布时间：2023-10-18

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Coding questions

For the coding part of questions, you are required to write the necessary R code and input the R code into the answer box for each question (similar to Assignment 1). Code should be written in "tidyverse" style -- with the functions learned from the lectures and pipes to join commands. Assume the following libraries have been loaded into R for you when typing your R code in the answer boxes:

library(tidyverse)

library(ggplot2)

library(here)

Data background: For questions 1 - 4, we will use the data containing some information for the states of the United States in 1977. In particular, we have the murder rates, percentage of highschool graduates, and percentage of frost. Recall that we have seen this data set in the tutorials, and this is a modified version of the data set. Assume the data is already read into R and saved as `USState` with 50 rows and 6 variables. Each row represents a different state.

The first column is the state, the second is the region the state belongs to and the remaining column measures the characteristics of the state. The first ten rows of `USState` look like this:

1) Suppose we want our data frame to be in a ‘long’ format sowe can plot them easily.

Please write the necessary R code to change the data set `USState` into a long format and store the resulting data object as `USStateLong` (Hint: you could use murder:frost to select columns from `murder` to `frost`). Input your R code into the provided answer box. The resulting data frame should look like the one below.

2) Using the data stored in `USStateLong`, write R code to make a boxplot for the

variables region on the x-axis, and the values of the different variables on they-axis., and facet by variable (Hint: use scales=”free_y” in the facet_wrap function to replicate the plot). The resulting plot should look like the one below:

3) Suppose now we want to focus on the southern states only. Write R code to keep the observations for the region equal to South and save the new data frame in a data object called `USStateSouth`. The resulting data frame should look like the one below:

4) Using the data stored in `USStateSouth`, write R code to make a text plot for the variables frost on the x-axis, and the murder on they-axis and using state as the text labels. The resulting plot should look like the one below:

Concept questions

1. Compare and contrast regression and classification.

2. What are the limitations of hierarchical clustering?

3. What is the difference between single linkage and complete linkage?

4. Describe an ROC (receiver operating characteristic curve). What does it plot and what is it used for?

Computation questions

1. After the first stage of the kmeans clustering algorithm with two variables (x andy), we have three centroids represented by (we have named them for convenience):

X1	X2	cluster name
32.1	23.3	A
45.3	57.3	B
1.3	25.4	C

We have an observation (x1=12, x2=40). Using Euclidean distance and Manhattan distance, calculate which cluster the point belongs to and report this distance. Round to two decimal places.

2. Calculate the root mean square error for the predicted and true values observed in the table below. Round to two decimal places.

Actual y	Predicted y
321	123
241	212
12	13
342	432
12	12

3. For the following table of predicted default vs. whether the individual actually defaulted (positive), what are themisclassification errors for both prediction methods?

Predictd1	Predicted2	Actual
Default	No Default	Default
Default	No default	No default
Default	No default	Default
No default	Default	Default
Default	Default	No default

4. An online provider of statistics courses is interested in assessing alternative sequencing and combinations of courses and therefore wishes to conduct association analysis on its data for past students. In the table, each row represents an individual student and each column represents a statistics course that they offer as identified by the column headings.

ID	Intro	Expt design	StatWrite	Survey	DataMining	Cat Data	Regression	Forecast
1	1	0	0	0	1	0	0	0
2	0	1	0	1	0	1	0	0
3	0	1	1	1	1	1	1	0
4	1	0	0	0	0	0	0	0
5	1	0	0	0	1	0	0	0
6	0	0	0	0	1	0	1	1
7	1	0	0	0	0	0	0	0
8	0	1	1	0	0	1	0	1
9	1	0	0	0	0	0	0	0
10	0	0	0	0	0	1	1	0
11	1	0	0	0	0	0	0	0
12	0	0	0	0	1	0	0	0
13	0	0	0	0	1	0	0	0
14	0	0	0	1	1	0	0	1
15	0	0	0	1	1	0	1	1
16	1	1	1	1	0	1	0	1
17	1	0	0	0	0	0	1	0