STATS 769
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 769
1. [10 marks] This question relates to a file called TMSTrafficQuarterHour .csv that is in a di- rectory called /course/Traffic/. The first few lines of the file are shown below.
The complete file has millions of lines.
class,siteRef,startDatetime,endDatetime,count
H,01600024,01-JUL-2020 07:15,01-JUL-2020 07:30,34 .5
L,01600024,01-JUL-2020 12:45,01-JUL-2020 13:00,467
(a) Write a single shell command that uses the head program to print just the
first line of the file TMSTrafficQuarterHour .csv, pipes the result to the awk program with an awk expression that reads the input using a comma as the field separator and prints out the first, second, third, and fifth fields separated by commas, and redirects the output to a file called traffic-daily .csv. In other words, your code should extract just the first line from the file TM- STrafficQuarterHour .csv and save it in the file traffic-daily .csv. The file traffic-daily .csv that your code should produce is shown below.
class,siteRef,startDatetime,count [3 marks]
(b) Write a single shell command that: uses the cat program to print the
file TMSTrafficQuarterHour .csv; pipes the result to the awk program with an awk expression that reads the input using a comma as the field separator, prints out the first field, then a comma, then the second field, then a comma, then the first 11 characters from the third field, then a space, then the fifth field; pipes that result to another awk program with an awk expression that reads the input using a space as the field separator, sums up the values in the second field for every unique value in the first field, and prints out all of the unique values and the respective sums; redirects that result and appends it to the file traffic-daily .csv.
In other words, your code should calculate the daily traffic counts for each site and class and add these to the file traffic-daily .csv.
The first few lines from the file traffic-daily .csv that your code should produce is shown below.
class,siteRef,startDatetime,count
L,01S00157,18-JUL-2020,3304
H,03500327,16-JUL-2020,824
HINT: the awk function substr() works like the R function of the same name; the first argument is a character string, the second argument says which character to start from, the third argument says which character to end at, and the result is the specified characters.
HINT: the awk expression below from Lab 3 might be useful. { a[$1] += $2 } END { for (i in a) print i, a[i] } [7 marks]
2. [10 marks]
This question relates to a set of 31 CSV files that are similar to the files that we used for Lab 5. The following output shows the first few lines of one of the files.
class,siteRef,startDatetime,endDatetime,count
H,01600024,01-JUL-2020 07:15,01-JUL-2020 07:30,34 .5
L,01600024,01-JUL-2020 12:45,01-JUL-2020 13:00,467
H,01600024,01-JUL-2020 13:30,01-JUL-2020 13:45,27
H,01600024,01-JUL-2020 16:15,01-JUL-2020 16:30,24 .5
L,01600024,01-JUL-2020 21:15,01-JUL-2020 21:30,151 .5
The following shell command and output will also be useful.
$ wc -l /course/Traffic/2* .csv
6823237 /course/Traffic/20130101_20130331_TMSTrafficQuarterHour .csv 6802374 /course/Traffic/20130401_20130630_TMSTrafficQuarterHour .csv 7082306 /course/Traffic/20130701_20130930_TMSTrafficQuarterHour .csv 6832212 /course/Traffic/20131001_20131231_TMSTrafficQuarterHour .csv 7086147 /course/Traffic/20140101_20140331_TMSTrafficQuarterHour .csv 7189628 /course/Traffic/20140401_20140630_TMSTrafficQuarterHour .csv 8054401 /course/Traffic/20140701_20140930_TMSTrafficQuarterHour .csv 7360774 /course/Traffic/20141001_20141231_TMSTrafficQuarterHour .csv 7233782 /course/Traffic/20150101_20150331_TMSTrafficQuarterHour .csv 7504376 /course/Traffic/20150401_20150630_TMSTrafficQuarterHour .csv 8070468 /course/Traffic/20150701_20150930_TMSTrafficQuarterHour .csv 7685947 /course/Traffic/20151001_20151231_TMSTrafficQuarterHour .csv 7490609 /course/Traffic/20160101_20160331_TMSTrafficQuarterHour .csv 7581537 /course/Traffic/20160401_20160630_TMSTrafficQuarterHour .csv 8067886 /course/Traffic/20160701_20160930_TMSTrafficQuarterHour .csv 7904881 /course/Traffic/20161001_20161231_TMSTrafficQuarterHour .csv 7936031 /course/Traffic/20170101_20170331_TMSTrafficQuarterHour .csv 8667814 /course/Traffic/20170401_20170630_TMSTrafficQuarterHour .csv 9252599 /course/Traffic/20170701_20170930_TMSTrafficQuarterHour .csv 8826874 /course/Traffic/20171001_20171231_TMSTrafficQuarterHour .csv 8883259 /course/Traffic/20180101_20180331_TMSTrafficQuarterHour .csv 9303862 /course/Traffic/20180401_20180630_TMSTrafficQuarterHour .csv 9984134 /course/Traffic/20180701_20180930_TMSTrafficQuarterHour .csv 9854018 /course/Traffic/20181001_20181231_TMSTrafficQuarterHour .csv 9304726 /course/Traffic/20190101_20190331_TMSTrafficQuarterHour .csv 9424922 /course/Traffic/20190401_20190630_TMSTrafficQuarterHour .csv 10032254 /course/Traffic/20190701_20190930_TMSTrafficQuarterHour .csv 9954827 /course/Traffic/20191001_20191231_TMSTrafficQuarterHour .csv 9316767 /course/Traffic/20200101_20200331_TMSTrafficQuarterHour .csv 9318499 /course/Traffic/20200401_20200630_TMSTrafficQuarterHour .csv 9078839 /course/Traffic/20200701_20200930_TMSTrafficQuarterHour .csv
257909990 total
In all of the 31 CSV files, each row contains a count of vehicles for a 15-minute interval at a specific site and for a specific class of vehicle.
The file traffic .csv contains the same information in a single file except that each row contains the daily count of traffic (at each site and for each class). The first few lines of the file traffic .csv are shown below. We will assume that the file traffic .csv is located in the current directory.
day,siteRef,class,count
21-FEB-2013,01S00376,H,1926
23-FEB-2013,01S00376,H,901 .5
24-FEB-2013,01S00376,L,12360 .5
25-FEB-2013,01S00376,L,9858 .5
25-FEB-2013,01S00376,H,1729 .5
The following code reads the file traffic .csv into an R data table and determines the class of each column of the data table.
> library(data .table)
> traffic <- fread("traffic .csv")
> lapply(traffic, class)
$day
[1] "character"
$siteRef
[1] "character"
$class
[1] "character"
$count
[1] "numeric"
(a) Calculate an estimate of the size of the data table traffic and explain
your calculation.
You can assume that there are about 1000 unique siteRef values and about 1000 unique day values.
HINT: there are four 15-minutes intervals per hour and 24 hours in a day. [5 marks]
(b) The following code uses gc() to monitor the memory usage as we read the file
traffic .csv into R in two different ways: once with fread() and once with read .csv().
> gc(reset=TRUE)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 356118 19 .1 666937 35 .7 356118 19 .1
Vcells 1777290 13 .6 10926652 83 .4 1777290 13 .6
> traffic <- fread("traffic .csv") > gc()
used (Mb) gc trigger (Mb) Ncells 360673 19 .3 666937 35 .7 Vcells 13688536 104 .5 22287514 170 .1
> trafficDF <- read .csv("traffic .csv") > gc()
used (Mb) gc trigger (Mb) Ncells 363380 19 .5 977634 52 .3 Vcells 24538655 187 .3 43128128 329 .1
max used (Mb)
362195 19 .4
13689907 104 .5
max used (Mb)
492730 26 .4
43099745 328 .9
Explain the used and max used values (in Mb) from the three gc() function calls. [5 marks]
3. [10 marks]
This question relates to the following code, which defines several functions, including the fitLM() function, and then profiles a call to the fitLM() function.
This question also relates to the output on the next page, which shows the file profile .html viewed in a web browser.
> library(MatrixModels)
> index <- sample(rep(1:10, length .out=nrow(traffic)))
> sites <- levels(factor(traffic$siteRef))
> RMSE <- function(o, p) {
sqrt(mean((o - p)^2))
}
> fitLM <- function(i) {
train <- traffic[index != i, ]
train[, siteRef := factor(siteRef, levels=sites)] test <- traffic[index == i, ]
test[, siteRef := factor(siteRef, levels=sites)]
fit <- glm4(count ~ class + siteRef, data=train, sparse=TRUE) coef <- coef(fit)
pred <- model .Matrix(count ~ class + siteRef, data=test, sparse=TRUE) %*% coef
obs <- test$count
RMSE(obs, pred[,1])
}
> library(profvis)
> p <- profvis(fitLM(1))
> htmlwidgets::saveWidget(p, "profile .html")
(a) Identify the function in which the most time was spent during the call to
fitLM(). We will refer to this function as “bigfun.”
Identify four other functions in which the most time was spent during the call to “bigfun.” [5 marks]
(b) The following code calls the fitLM() function ten times and collects the ten
results.
> rmse <- NULL
> for (i in 1:10) {
rmse <- c(rmse, fitLM(i))
}
Identify one problem with the code above that makes it inefficient.
Write TWO different versions of the R code that would be more efficient. [5 marks]
4. [10 marks]
This question relates to the following code, which defines several functions, including the fitLM() function, and then calls the fitLM() function ten times, in parallel.
> index <- sample(rep(1:10, length .out=nrow(traffic)))
> sites <- levels(factor(traffic$siteRef))
> RMSE <- function(o, p) {
sqrt(mean((o - p)^2))
}
> fitLM <- function(i) {
train <- traffic[index != i, ]
train[, siteRef := factor(siteRef, levels=sites)] test <- traffic[index == i, ]
test[, siteRef := factor(siteRef, levels=sites)]
fit <- glm4(count ~ class + siteRef, data=train, sparse=TRUE) coef <- coef(fit)
pred <- model .Matrix(count ~ class + siteRef, data=test, sparse=TRUE) %*% coef
obs <- test$count
RMSE(obs, pred[,1])
}
> library(parallel)
> cl <- makeCluster(10)
> clusterExport(cl, c("traffic", "index", "sites", "RMSE")) > clusterEvalQ(cl, { library(data .table); library(MatrixModels) }) > rmse <- parLapply(cl, 1:10, fitLM)
> stopCluster(cl)
(a) Write down what would the R object rmse would look like.
Explain what the clusterExport() function call is doing. Why is this func- tion call necessary?
Explain what the clusterEvalQ() function call is doing. Why is this function call necessary? [5 marks]
The following code and output measures the time taken for each line of the code that calls the fitLM() function ten times, in parallel.
> library(parallel)
> system .time({
cl <- makeCluster(10)
})
user system elapsed
0.012 0.064 0.426
> system .time({
clusterExport(cl, c("traffic", "index", "sites", "RMSE")) })
user system elapsed
9.944 1.036 16.685
> system .time({
clusterEvalQ(cl,
})
user 0.000
{ library(data .table); library(MatrixModels) })
system elapsed
0.000 1.873
> system .time({
rmse <- parLapply(cl, 1:10, fitLM)
})
user system elapsed
0.008 0.000 15.812
> system .time({
stopCluster(cl)
})
user system elapsed
0.000 0.004 0.002
(b) Identify which lines of the code took the longest to run.
Explain why those lines of code took so long to run.
Explain why the user time is zero for some lines of code, even when the elapsed time is large.
HINT: Remember your estimate of the size of the traffic data table. [5 marks]
5. [12 marks] Consider the topics for linear regression, subset selection and regu- larisation methods.
(a) Explain why one should not directly choose the OLS estimator as the final
model for linear regression. [4 marks]
(b) Let r be the output of function regsubsets() for a thorough backward selec-
tion for some data set. How many predictors are to be removed from the full model according to the criterion used here? Explain your answer.
> summary(r)$bic
[1] -385 .05 -496 .26 -549 .48 -561 .99 -585 .68 -592 .27 -592 .04 [8] -595 .32 -598 .86 -603 .99 -608 .04 -601 .92 -595 .70 [4 marks]
(c) The following plot shows a coefficient profile of the Lasso fit to a data set. If the tuning parameter 入 = 1, how many predictors will remain in the Lasso model? Explain your answer.
12 11 11 12 8 5 3 0
−5 −4 −3 −2 −1 0 1 2
Log Lambda
[4 marks]
6. [12 marks] Consider the topics for nonlinearity and resampling methods.
(a) A cubic regression spline fit to a data set is shown below. To obtain a better
predictive model, how should we modify the regression spline model?
6 8 10 12 14
x [4 marks]
(b) A generalised additive model is fitted to a regression data set, with fitted
additive terms shown in the plot below. To obtain a better predictive model, how would you like to modify this model?
0 20 40 60 80
x1
4 5 6 7 8
0 5 10 15
x2
14 16 18 20 22
x3 x4 [4 marks]
(c) An analyst wants to compare the performance of different regression methods in aid of 10-fold cross-validation (for just 1 repetition). Describe how this should be implemented so as to make sure that the “using the same subsamples” technique is used. [4 marks]
7. [12 marks] Consider the topics for basic classification methods.
(a) The naiveBayes() function is applied to a data set. Based on the output
below, find the decision boundary.
> naiveBayes(class ~ x, data=data)
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes .default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
abnormal normal
0.6 0.4
Conditional probabilities:
x
Y [,1] [,2]
abnormal 4 .96 1 .45
normal 1 .74 1 .15 [4 marks]
(b) The glm() function is applied to the same data set used in Part (a). Based
on the output below, what is the predicted class label for a new observation z = 2.
> glm(class ~ x, family="binomial", data=data)
Call: glm(formula = class ~ x, family = "binomial", data = data)
Coefficients:
(Intercept) x
5 .58 -1 .84
Degrees of Freedom: 99 Total (i .e . Null); 98 Residual Null Deviance: 135
Residual Deviance: 52 .1 AIC: 56 .1 [4 marks]
(c) Explain what the following R code does.
> xm = as .matrix(data$x)
> unlist(mclapply(1:10, function(k)
sum(data$class != knn(train=xm, test=xm, cl=data$class, k=k)), mc .cores=10))
[1] 0 5 11 9 13 12 12 12 12 12 [4 marks]
8. [12 marks] Consider the topics for tree-based models and ensemble methods. The following data set with 768 observations is used in all three parts of this ques-
tion.
> head(data)
x1 x2 x3 x4 x5 x6 x7 x8 class
1 6 148 72 35 0 33 .6 0 .627 50 B
2 1 85 66 29 0 26 .6 0 .351 31 A
3 8 183 64 0 0 23 .3 0 .672 32 B
4 1 89 66 23 94 28 .1 0 .167 21 A
5 0 137 40 35 168 43 .1 2 .288 33 B
6 5 116 74 0 0 25 .6 0 .201 30 A
(a) For a data set saved in variable data, a pruned tree is produced as follows.
What is the resubstitution misclassification rate for the pruned tree? > r = tree(class ~ . , data=data)
> prune .tree(r, best=3)
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 768 990 A ( 0 .651 0 .349 )
2) x2 < 127 .5 485 480 A ( 0 .806 0 .194 )
4) x8 < 28 .5 271 160 A ( 0 .915 0 .085 ) *
5) x8 > 28 .5 214 270 A ( 0 .668 0 .332 ) *
3) x2 > 127 .5 283 380 B ( 0 .385 0 .615 ) * [4 marks]
(b) Cross-validation is adopted as follows to help prune the tree. Based on the
results, how many nodes (not just the leaf nodes) should the pruned tree have? > cv .tree(r, method="misclass")
$size
[1] 11 8 7 3 2 1
$dev
[1] 191 191 191 184 210 270
$k
[1] -Inf 0 1 4 28 65
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree .sequence" [4 marks]
(c) A Random Forest model has also been built, with some of the results given below. Can this model be improved if more trees are constructed? Explain your answer.
> (r = randomForest(class ~ . , data=data, mtry=3))
Call:
randomForest(formula = class ~ . , data = data, mtry = 3) Type of random forest: classification
Number of trees: 500
No . of variables tried at each split: 3
OOB estimate of error rate: 23 .7%
Confusion matrix:
A B class .error
A 425 75 0 .15000
B 107 161 0 .39925
> plot(r, lwd=2)
> legend("topright", leg=colnames(r$err .rate), lwd=2, lty=1, col=1:3)
r
|
||||||
|
|
|
|
|
|
|
0 100 200 300 400 500
trees
[4 marks]
9. [12 marks] Consider the topics for density estimation and clustering.
(a) A density estimate has been produced by the following R code. Describe the
model that is used here.
> r = densityMclust(data, G=6, modelNames="EEE")
> plot(r, data, what="density", col="blue", points .col="grey")
2022-10-15