闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS 769

1. [10 marks] This question relates to a ﬁle called TMSTrafficQuarterHour .csv that is in a di- rectory called /course/Traffic/. The ﬁrst few lines of the ﬁle are shown below.

The complete ﬁle has millions of lines.

class,siteRef,startDatetime,endDatetime,count

H,01600024,01-JUL-2020 07:15,01-JUL-2020 07:30,34 .5

L,01600024,01-JUL-2020 12:45,01-JUL-2020 13:00,467

(a) Write a single shell command that uses the head program to print just the

ﬁrst line of the ﬁle TMSTrafficQuarterHour .csv, pipes the result to the awk program with an awk expression that reads the input using a comma as the ﬁeld separator and prints out the ﬁrst, second, third, and ﬁfth ﬁelds separated by commas, and redirects the output to a ﬁle called traffic-daily .csv. In other words, your code should extract just the ﬁrst line from the ﬁle TM- STrafficQuarterHour .csv and save it in the ﬁle traffic-daily .csv. The ﬁle traffic-daily .csv that your code should produce is shown below.

class,siteRef,startDatetime,count [3 marks]

(b) Write a single shell command that: uses the cat program to print the

ﬁle TMSTrafficQuarterHour .csv; pipes the result to the awk program with an awk expression that reads the input using a comma as the ﬁeld separator, prints out the ﬁrst ﬁeld, then a comma, then the second ﬁeld, then a comma, then the ﬁrst 11 characters from the third ﬁeld, then a space, then the ﬁfth ﬁeld; pipes that result to another awk program with an awk expression that reads the input using a space as the ﬁeld separator, sums up the values in the second ﬁeld for every unique value in the ﬁrst ﬁeld, and prints out all of the unique values and the respective sums; redirects that result and appends it to the ﬁle traffic-daily .csv.

In other words, your code should calculate the daily traﬃc counts for each site and class and add these to the ﬁle traffic-daily .csv.

The ﬁrst few lines from the ﬁle traffic-daily .csv that your code should produce is shown below.

class,siteRef,startDatetime,count

L,01S00157,18-JUL-2020,3304

H,03500327,16-JUL-2020,824

HINT: the awk function substr() works like the R function of the same name; the ﬁrst argument is a character string, the second argument says which character to start from, the third argument says which character to end at, and the result is the speciﬁed characters.

HINT: the awk expression below from Lab 3 might be useful. { a[$1] += $2 } END { for (i in a) print i, a[i] } [7 marks]

2. [10 marks]

This question relates to a set of 31 CSV ﬁles that are similar to the ﬁles that we used for Lab 5. The following output shows the ﬁrst few lines of one of the ﬁles.

class,siteRef,startDatetime,endDatetime,count

H,01600024,01-JUL-2020 07:15,01-JUL-2020 07:30,34 .5

L,01600024,01-JUL-2020 12:45,01-JUL-2020 13:00,467

H,01600024,01-JUL-2020 13:30,01-JUL-2020 13:45,27

H,01600024,01-JUL-2020 16:15,01-JUL-2020 16:30,24 .5

L,01600024,01-JUL-2020 21:15,01-JUL-2020 21:30,151 .5

The following shell command and output will also be useful.

$ wc -l /course/Traffic/2* .csv

6823237 /course/Traffic/20130101_20130331_TMSTrafficQuarterHour .csv 6802374 /course/Traffic/20130401_20130630_TMSTrafficQuarterHour .csv 7082306 /course/Traffic/20130701_20130930_TMSTrafficQuarterHour .csv 6832212 /course/Traffic/20131001_20131231_TMSTrafficQuarterHour .csv 7086147 /course/Traffic/20140101_20140331_TMSTrafficQuarterHour .csv 7189628 /course/Traffic/20140401_20140630_TMSTrafficQuarterHour .csv 8054401 /course/Traffic/20140701_20140930_TMSTrafficQuarterHour .csv 7360774 /course/Traffic/20141001_20141231_TMSTrafficQuarterHour .csv 7233782 /course/Traffic/20150101_20150331_TMSTrafficQuarterHour .csv 7504376 /course/Traffic/20150401_20150630_TMSTrafficQuarterHour .csv 8070468 /course/Traffic/20150701_20150930_TMSTrafficQuarterHour .csv 7685947 /course/Traffic/20151001_20151231_TMSTrafficQuarterHour .csv 7490609 /course/Traffic/20160101_20160331_TMSTrafficQuarterHour .csv 7581537 /course/Traffic/20160401_20160630_TMSTrafficQuarterHour .csv 8067886 /course/Traffic/20160701_20160930_TMSTrafficQuarterHour .csv 7904881 /course/Traffic/20161001_20161231_TMSTrafficQuarterHour .csv 7936031 /course/Traffic/20170101_20170331_TMSTrafficQuarterHour .csv 8667814 /course/Traffic/20170401_20170630_TMSTrafficQuarterHour .csv 9252599 /course/Traffic/20170701_20170930_TMSTrafficQuarterHour .csv 8826874 /course/Traffic/20171001_20171231_TMSTrafficQuarterHour .csv 8883259 /course/Traffic/20180101_20180331_TMSTrafficQuarterHour .csv 9303862 /course/Traffic/20180401_20180630_TMSTrafficQuarterHour .csv 9984134 /course/Traffic/20180701_20180930_TMSTrafficQuarterHour .csv 9854018 /course/Traffic/20181001_20181231_TMSTrafficQuarterHour .csv 9304726 /course/Traffic/20190101_20190331_TMSTrafficQuarterHour .csv 9424922 /course/Traffic/20190401_20190630_TMSTrafficQuarterHour .csv 10032254 /course/Traffic/20190701_20190930_TMSTrafficQuarterHour .csv 9954827 /course/Traffic/20191001_20191231_TMSTrafficQuarterHour .csv 9316767 /course/Traffic/20200101_20200331_TMSTrafficQuarterHour .csv 9318499 /course/Traffic/20200401_20200630_TMSTrafficQuarterHour .csv 9078839 /course/Traffic/20200701_20200930_TMSTrafficQuarterHour .csv

257909990 total

In all of the 31 CSV ﬁles, each row contains a count of vehicles for a 15-minute interval at a speciﬁc site and for a speciﬁc class of vehicle.

The ﬁle traffic .csv contains the same information in a single ﬁle except that each row contains the daily count of traﬃc (at each site and for each class). The ﬁrst few lines of the ﬁle traffic .csv are shown below. We will assume that the ﬁle traffic .csv is located in the current directory.

day,siteRef,class,count

21-FEB-2013,01S00376,H,1926

23-FEB-2013,01S00376,H,901 .5

24-FEB-2013,01S00376,L,12360 .5

25-FEB-2013,01S00376,L,9858 .5

25-FEB-2013,01S00376,H,1729 .5

The following code reads the ﬁle traffic .csv into an R data table and determines the class of each column of the data table.

> library(data .table)

> traffic <- fread("traffic .csv")

> lapply(traffic, class)

$day

[1] "character"

$siteRef

[1] "character"

$class

[1] "character"

$count

[1] "numeric"

(a) Calculate an estimate of the size of the data table traffic and explain

your calculation.

You can assume that there are about 1000 unique siteRef values and about 1000 unique day values.

HINT: there are four 15-minutes intervals per hour and 24 hours in a day. [5 marks]

(b) The following code uses gc() to monitor the memory usage as we read the ﬁle

traffic .csv into R in two diﬀerent ways: once with fread() and once with read .csv().

> gc(reset=TRUE)

used (Mb) gc trigger (Mb) max used (Mb)

Ncells 356118 19 .1 666937 35 .7 356118 19 .1

Vcells 1777290 13 .6 10926652 83 .4 1777290 13 .6

> traffic <- fread("traffic .csv") > gc()

used (Mb) gc trigger (Mb) Ncells 360673 19 .3 666937 35 .7 Vcells 13688536 104 .5 22287514 170 .1

> trafficDF <- read .csv("traffic .csv") > gc()

used (Mb) gc trigger (Mb) Ncells 363380 19 .5 977634 52 .3 Vcells 24538655 187 .3 43128128 329 .1

max used (Mb)

362195 19 .4

13689907 104 .5

max used (Mb)

492730 26 .4

43099745 328 .9

Explain the used and max used values (in Mb) from the three gc() function calls. [5 marks]

3. [10 marks]

This question relates to the following code, which deﬁnes several functions, including the fitLM() function, and then proﬁles a call to the fitLM() function.

This question also relates to the output on the next page, which shows the ﬁle profile .html viewed in a web browser.

> library(MatrixModels)

> index <- sample(rep(1:10, length .out=nrow(traffic)))

> sites <- levels(factor(traffic$siteRef))

> RMSE <- function(o, p) {

sqrt(mean((o - p)^2))

}

> fitLM <- function(i) {

train <- traffic[index != i, ]

train[, siteRef := factor(siteRef, levels=sites)] test <- traffic[index == i, ]

test[, siteRef := factor(siteRef, levels=sites)]

fit <- glm4(count ~ class + siteRef, data=train, sparse=TRUE) coef <- coef(fit)

pred <- model .Matrix(count ~ class + siteRef, data=test, sparse=TRUE) %*% coef

obs <- test$count

RMSE(obs, pred[,1])

}

> library(profvis)

> p <- profvis(fitLM(1))

> htmlwidgets::saveWidget(p, "profile .html")

(a) Identify the function in which the most time was spent during the call to

fitLM(). We will refer to this function as “bigfun.”

Identify four other functions in which the most time was spent during the call to “bigfun.” [5 marks]

(b) The following code calls the fitLM() function ten times and collects the ten

results.

> rmse <- NULL

> for (i in 1:10) {

rmse <- c(rmse, fitLM(i))

}

Identify one problem with the code above that makes it ineﬃcient.

Write TWO diﬀerent versions of the R code that would be more eﬃcient. [5 marks]

4. [10 marks]

This question relates to the following code, which deﬁnes several functions, including the fitLM() function, and then calls the fitLM() function ten times, in parallel.

> index <- sample(rep(1:10, length .out=nrow(traffic)))

> sites <- levels(factor(traffic$siteRef))

> RMSE <- function(o, p) {

sqrt(mean((o - p)^2))

}

> fitLM <- function(i) {

train <- traffic[index != i, ]

train[, siteRef := factor(siteRef, levels=sites)] test <- traffic[index == i, ]

test[, siteRef := factor(siteRef, levels=sites)]

fit <- glm4(count ~ class + siteRef, data=train, sparse=TRUE) coef <- coef(fit)

pred <- model .Matrix(count ~ class + siteRef, data=test, sparse=TRUE) %*% coef

obs <- test$count

RMSE(obs, pred[,1])

}

> library(parallel)

> cl <- makeCluster(10)

> clusterExport(cl, c("traffic", "index", "sites", "RMSE")) > clusterEvalQ(cl, { library(data .table); library(MatrixModels) }) > rmse <- parLapply(cl, 1:10, fitLM)

> stopCluster(cl)

(a) Write down what would the R object rmse would look like.

Explain what the clusterExport() function call is doing. Why is this func- tion call necessary?

Explain what the clusterEvalQ() function call is doing. Why is this function call necessary? [5 marks]

The following code and output measures the time taken for each line of the code that calls the fitLM() function ten times, in parallel.

> library(parallel)

> system .time({

cl <- makeCluster(10)

})

user system elapsed

0.012 0.064 0.426

> system .time({

clusterExport(cl, c("traffic", "index", "sites", "RMSE")) })

user system elapsed

9.944 1.036 16.685

> system .time({

clusterEvalQ(cl,

})

user 0.000

{ library(data .table); library(MatrixModels) })

system elapsed

0.000 1.873

> system .time({

rmse <- parLapply(cl, 1:10, fitLM)

})

user system elapsed

0.008 0.000 15.812

> system .time({

stopCluster(cl)

})

user system elapsed

0.000 0.004 0.002

(b) Identify which lines of the code took the longest to run.

Explain why those lines of code took so long to run.

Explain why the user time is zero for some lines of code, even when the elapsed time is large.

HINT: Remember your estimate of the size of the traffic data table. [5 marks]

5. [12 marks] Consider the topics for linear regression, subset selection and regu- larisation methods.

(a) Explain why one should not directly choose the OLS estimator as the ﬁnal

model for linear regression. [4 marks]

(b) Let r be the output of function regsubsets() for a thorough backward selec-

tion for some data set. How many predictors are to be removed from the full model according to the criterion used here? Explain your answer.

> summary(r)$bic

[1] -385 .05 -496 .26 -549 .48 -561 .99 -585 .68 -592 .27 -592 .04 [8] -595 .32 -598 .86 -603 .99 -608 .04 -601 .92 -595 .70 [4 marks]

(c) The following plot shows a coeﬃcient proﬁle of the Lasso ﬁt to a data set. If the tuning parameter 入 = 1, how many predictors will remain in the Lasso model? Explain your answer.

12 11 11 12 8 5 3 0

−5 −4 −3 −2 −1 0 1 2

Log Lambda

[4 marks]

6. [12 marks] Consider the topics for nonlinearity and resampling methods.

(a) A cubic regression spline ﬁt to a data set is shown below. To obtain a better

predictive model, how should we modify the regression spline model?

6 8 10 12 14

x [4 marks]

(b) A generalised additive model is ﬁtted to a regression data set, with ﬁtted

additive terms shown in the plot below. To obtain a better predictive model, how would you like to modify this model?

0 20 40 60 80

4 5 6 7 8

0 5 10 15

14 16 18 20 22

x3 x4 [4 marks]

(c) An analyst wants to compare the performance of diﬀerent regression methods in aid of 10-fold cross-validation (for just 1 repetition). Describe how this should be implemented so as to make sure that the “using the same subsamples” technique is used. [4 marks]

7. [12 marks] Consider the topics for basic classiﬁcation methods.

(a) The naiveBayes() function is applied to a data set. Based on the output

below, ﬁnd the decision boundary.

> naiveBayes(class ~ x, data=data)

Naive Bayes Classifier for Discrete Predictors

Call:

naiveBayes .default(x = X, y = Y, laplace = laplace)

A-priori probabilities:

abnormal normal

0.6 0.4

Conditional probabilities:

Y [,1] [,2]

abnormal 4 .96 1 .45

normal 1 .74 1 .15 [4 marks]

(b) The glm() function is applied to the same data set used in Part (a). Based

on the output below, what is the predicted class label for a new observation z = 2.

> glm(class ~ x, family="binomial", data=data)

Call: glm(formula = class ~ x, family = "binomial", data = data)

Coefficients:

(Intercept) x

5 .58 -1 .84

Degrees of Freedom: 99 Total (i .e . Null); 98 Residual Null Deviance: 135

Residual Deviance: 52 .1 AIC: 56 .1 [4 marks]

> xm = as .matrix(data$x)

> unlist(mclapply(1:10, function(k)

sum(data$class != knn(train=xm, test=xm, cl=data$class, k=k)), mc .cores=10))

[1] 0 5 11 9 13 12 12 12 12 12 [4 marks]

8. [12 marks] Consider the topics for tree-based models and ensemble methods. The following data set with 768 observations is used in all three parts of this ques-

tion.

> head(data)

x1 x2 x3 x4 x5 x6 x7 x8 class

1 6 148 72 35 0 33 .6 0 .627 50 B

2 1 85 66 29 0 26 .6 0 .351 31 A

3 8 183 64 0 0 23 .3 0 .672 32 B

4 1 89 66 23 94 28 .1 0 .167 21 A

5 0 137 40 35 168 43 .1 2 .288 33 B

6 5 116 74 0 0 25 .6 0 .201 30 A

(a) For a data set saved in variable data, a pruned tree is produced as follows.

What is the resubstitution misclassiﬁcation rate for the pruned tree? > r = tree(class ~ . , data=data)

> prune .tree(r, best=3)

node), split, n, deviance, yval, (yprob)

* denotes terminal node

1) root 768 990 A ( 0 .651 0 .349 )

2) x2 < 127 .5 485 480 A ( 0 .806 0 .194 )

4) x8 < 28 .5 271 160 A ( 0 .915 0 .085 ) *

5) x8 > 28 .5 214 270 A ( 0 .668 0 .332 ) *

3) x2 > 127 .5 283 380 B ( 0 .385 0 .615 ) * [4 marks]

(b) Cross-validation is adopted as follows to help prune the tree. Based on the

results, how many nodes (not just the leaf nodes) should the pruned tree have? > cv .tree(r, method="misclass")

$size

[1] 11 8 7 3 2 1

$dev

[1] 191 191 191 184 210 270

[1] -Inf 0 1 4 28 65

$method

[1] "misclass"

attr(,"class")

[1] "prune" "tree .sequence" [4 marks]

(c) A Random Forest model has also been built, with some of the results given below. Can this model be improved if more trees are constructed? Explain your answer.

> (r = randomForest(class ~ . , data=data, mtry=3))

Call:

randomForest(formula = class ~ . , data = data, mtry = 3) Type of random forest: classification

Number of trees: 500

No . of variables tried at each split: 3

OOB estimate of error rate: 23 .7%

Confusion matrix:

A B class .error

A 425 75 0 .15000

B 107 161 0 .39925

> plot(r, lwd=2)

> legend("topright", leg=colnames(r$err .rate), lwd=2, lty=1, col=1:3)