STATS 769 Advanced Data Science Practice (Exam) SEMESTER TWO 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
STATS 769 Advanced Data Science Practice (Exam)
SEMESTER TWO 2022
Campus: City, Offshore Online,
STATISTICS
Advanced Data Science Practice
Time Allowed: 2 hours and 30 min additional time
NOTES
• Functions are often used in questions without specifying the packages they belong to. This simply means that they are the same ones as used in lectures and labs .
• Attempt all questions in Part A and Part B.
• The total marks for this examination are 100 marks
i By submitting this assessment, I agree to the following declaration:
As a member of the University’s student body, I will complete this assessment with academic
integrity and in a fair, honest, responsible, and trustworthy manner. This means that:
I will not seek out any unauthorised help in completing this assessment. Unauthorised help
includes , but is not limited to, asking another person, friend, family member, third party,
tutorial, search function or answer service, whether in person or online.
I will not discuss or share the content of the assessment with anyone else in any form during
the assessment period, including but not limited to, using a messaging service,
communication channel or discussion forum , Canvas , Piazza, Chegg, third party website,
Facebook , Twitter, Discord, social media or any other channel within the assessment period.
I will not reproduce and/or share the content of this assessment in any domain or in any form
where it may be accessed by a third party.
I will not share my answers or thoughts regarding this assessment in any domain or in any
form within the assessment period.
I am aware the University of Auckland may use Turnitin or any other plagiarism detecting
methods to check my content.
I declare that this assessment is my own work , except where acknowledged appropriately
(e.g., use of referencing).
I declare that this work has not been submitted for academic credit in this or another
University of Auckland course, or elsewhere.
I understand the University expects all students to complete coursework with integrity and
honesty. I promise to complete all online assessment with the same academic integrity standards
and values .
Any identified form of poor academic practice or academic misconduct will be followed up and may
result in disciplinary action.
I confirm that by completing this exam I agree to the above statements in full.
i Questions 1-4 are for Part A of the course
Questions 5-9 are for Part B of the course -- data mining.
For Part B, functions are often used in questions without specifying the packages they belong to.
This simply means that they are the same ones as used in lectures and labs .
1 You are in a Unix shell showing the following output:
$ ls -l
total 0
drwx------ 2 usr01 datasci 142 Sep 23 11:27 data
drwxrwxr-x 2 usr01 datasci 34 Sep 23 11:27 scripts
$ ls -l data
total 1088620
-rw------- 1 usr01 datasci 501039472 Sep 23 11:25 accounting-2021-12 .csv
-rw------- 1 usr01 datasci 127162942 Sep 23 11:24 accounting-2022-01 .csv
-rw------- 1 usr01 datasci 486518821 Sep 23 11:25 accounting-2022-02 .csv
-rw------- 1 usr01 datasci 90 Sep 23 11:25 accounting-notes .txt
-rw-rw---- 1 usr01 datasci 9738 Sep 23 11:27 venues .csv
$ ls -l scripts
total 8
-rw-rw-rw- 1 usr01 datasci 240 Sep 23 11:26 code .R
-rwxrwxrwx 1 usr01 datasci 25 Sep 23 11:26 run .sh
Your unix username is usr01 and you are member of the datasci group. Write one or more
shell commands that ensure for your colleagues , which are also members of the group datasci,
all of the following:
they are able to read the accounting-YYYY-MM .csv datasets where YYYY-MM is a year
and month, but not any other files
they cannot write to any of the above files or directories
In addition, any other users outside of the datasci group should not be able to read, write or
access any of the above files or directories .
Maximum marks: 8
2 Let us assume you have a directory with many CSV files in the form YYYY .csv where YYYY is a year:
$ ls -l | head -n5
total 11747328
-rw-r--r-- 1 surb939 surb939 127162942 Jul 19 13:33 1987 .csv
-rw-r--r-- 1 surb939 surb939 501039472 Jul 19 13:33 1988 .csv
-rw-r--r-- 1 surb939 surb939 486518821 Jul 19 13:33 1989 .csv
-rw-r--r-- 1 surb939 surb939 509194687 Jul 19 13:33 1990 .csv
Here are the first few lines of such a CSV file (they make up the Flights table from the Airline
data used in the lecture):
$ head 1987 .csv | head -n5
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,
ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,
CancellationCode,Diverted,CarrierDelay,WeatherDelay,SDelay,SecurityDelay,LateAircraftDelay
1987,10,14,3,741,730,912,849,PS,1451,,91,79,,23,11,SAN,SFO,447,,,0,,0,,,,,
1987,10,15,4,729,730,903,849,PS,1451,,94,79,,14,-1,SAN,SFO,447,,,0,,0,,,,,
1987,10,17,6,741,730,918,849,PS,1451,,97,79,,29,11,SAN,SFO,447,,,0,,0,,,,,
1987,10,18,7,729,730,847,849,PS,1451,,78,79,,-2,-1,SAN,SFO,447,,,0,,0,,,,,
Write a shell script which uses GNU parallel with 6 parallel jobs to extract all rows from the CSV
files where the Origin or Dest column have the value SFO into a file all-SFO .csv. Note that
the resulting file is expected to be a valid CSV file (including exactly one header line). You can
assume that all the CSV files have the same column structure as shown above, but you cannot
assume that the string "SFO" will occur only in the desired columns .
(If you wish, you can use multiple shell scripts . If you do, separate them using comments showing
the file names ending with .sh.)
Maximum marks: 12
3 Let us consider a subset of the NYC Taxi dataset used in the labs and lecture which just contains
the pick-up locations of taxi trips , consisting of two columns: longitude and latitude:
$ hadoop fs -cat /data/nyctaxi/pickup/2010/2010-01 .txt | head -n5
-73 .992818,40 .753273
-74 .015895,40 .711385
-73 .960341,40 .779052
-73 .97448,40 .793495
-73 .980407,40 .761308
The goal is to use Hadoop Map/Reduce to count how many trips start in each ZIP code (i.e.,
postal code) using R. You can assume that there is a function latlon2zip(lat, lon) which
takes the geographical coordinates of one or more locations and returns a character vector of the
same length as the number of locations with the resulting ZIP code:
> d = read .table(
pipe("hadoop fs -cat /data/nyctaxi/pickup/2010/2010-01 .txt | head"), FALSE, ",")
> latlon2zip(d[[2]], d[[1]])
[1] "10018" "10280" "10028" "10025" "10019" "10128" "10065" "10036" "10019"
Define the variables map and reduce in R such that the following code works and computes the
total number of points in each ZIP code efficiently. Explain each step of your code in comments
(including what are keys and what are values).
library(hmapred)
library(iotools)
r <- hmr(hinput("/data/nyctaxi/pickup/2010",
formatter=function(r) dstrsplit(r, list(lon=1, lat=1), sep=",")),
aux=list(latlon2zip=latlon2zip), map=map, reduce=reduce, reducers=6)
Maximum marks: 10
Can functions passed to mclapply() use any packages and variables present in the workspace
before mclapply was called? (Only if mc .cores= 1, Only if
the variables are exported, No, Yes , always)
You have a large CSV file consisting of just three numeric columns (real numbers). Let's say N is
the number of lines in the file and S is the number of bytes in the file. What is the best estimate of
the memory necessary to store the content of the CSV file as a data frame in R?
Which of the following values is reported by system .time() and is the closest to the time the
user has to wait for the code to finish? (user, system , elapsed, wall, cpu)
Given an XML file which starts like this:
<?xml version="1 .0" encoding="UTF-8"?>
<data>
<country code="NZ" name="New Zealand">
<value variable="population" year="1972">2903900</value>
which of the following XPath queries will select the year attribute of the value node when applied
(country/value/@year, data/country/value/@year, //value@year, //country/value[year])
Maximum marks: 10
5 Part (a) [4 marks]
The naiveBayes function is applied to a data set. Based on the output below, find the predicted
class label for observation x = 4 .7. (You may want to use the dnorm function in your
calculation.)
> naiveBayes(class ~ x, data=data)
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes .default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
A B
0 .4 0 .6
Conditional probabilities:
x
Y [,1] [,2]
A 4 .2812 0 .48021
B 5 .0500 0 .55390
Part (b) [4 marks]
The lda function is applied to the same data set used in Part (a). Based on the output below, what
is the predicted class label for a new observation x = 6? Explain your answer.
> predict(lda(class ~ x, data=data), data .frame(x = 5:4))$posterior
A B
1 0 .20833 0 .79167
2 0 .80923 0 .19077
Part (c) [4 marks]
By making use of the information provided as follows
> xm = as .matrix(data$x)
> table(data$class, knn(train=xm, test=xm, cl=data$class, k=10))
A B
A 19 13
B 9 39
write down the output of the following R code:
> table(data$class, knn(train=xm, test=xm, cl=data$class, k=80))
Maximum marks: 12
6 Part (a) [4 marks]
Find the four most important splits in the following pruned tree.
> (r = prune .tree(tree(class ~ ., data=data, mindev=0 .001), method="misclass", best=6))
node), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 238 450 .0 healthy ( 0 .576 0 .286 0 .139 )
2) x1: high,middle 108 230 .0 mild ( 0 .278 0 .472 0 .250 )
4) x8 < 0 .5 50 97 .0 healthy ( 0 .500 0 .380 0 .120 )
8) x7 < 2 .55 45 75 .0 healthy ( 0 .556 0 .400 0 .044 )
16) x5 < 23 .5 9 15 .0 mild ( 0 .111 0 .667 0 .222 ) *
17) x5 > 23 .5 36 46 .0 healthy ( 0 .667 0 .333 0 .000 )
34) x1: high 31 41 .0 healthy ( 0 .613 0 .387 0 .000 )
68) x5 < 30 .5 9 11 .0 mild ( 0 .333 0 .667 0 .000 ) *
69) x5 > 30 .5 22 26 .0 healthy ( 0 .727 0 .273 0 .000 ) *
35) x1: middle 5 0 .0 healthy ( 1 .000 0 .000 0 .000 ) *
9) x7 > 2 .55 5 5 .0 severe ( 0 .000 0 .200 0 .800 ) *
5) x8 > 0 .5 58 110 .0 mild ( 0 .086 0 .552 0 .362 )
10) x7 < 0 .3 10 22 .0 healthy ( 0 .400 0 .300 0 .300 )
20) x5 < 33 5 6 .7 healthy ( 0 .600 0 .000 0 .400 ) *
21) x5 > 33 5 9 .5 mild ( 0 .200 0 .600 0 .200 ) *
11) x7 > 0 .3 48 72 .0 mild ( 0 .021 0 .604 0 .375 )
22) x5 < 26 .5 26 30 .0 mild ( 0 .000 0 .731 0 .269 ) *
23) x5 > 26 .5 22 37 .0 severe ( 0 .045 0 .455 0 .500 )
46) x6: level 15 20 .0 severe ( 0 .000 0 .400 0 .600 ) *
47) x6: norm 7 13 .0 mild ( 0 .143 0 .571 0 .286 ) *
3) x1: low 130 150 .0 healthy ( 0 .823 0 .131 0 .046 ) *
> plot(r)
> text(r, pretty=0)
Part (b) [4 marks]
Using the pruned tree shown in Part (a), find the predicted class label for the following observation:
x1 x2 x3 x4 x5 x6 x7 x8
high NA md NA NA norm 2 .4 0 .4
Part (c) [4 marks]
Based on the following cross-validation study, how many leaf nodes does the best pruned tree
have? Explain your answer.
> r = tree(class ~ ., data=data, mindev=0 .001)
> prune .tree(r, method="misclass")
$size
[1] 29 21 17 11 5 3 2 1
$dev
[1] 48 48 50 56 65 74 80 101
$k
[1] -Inf 0 .0 0 .5 1 .0 1 .5 4 .5 6 .0 21 .0
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree .sequence"
> rowMeans(simplify2array(mclapply(1:20, function( . . .) cv .tree(r, method="misclass")$dev, mc .cores=20)))
[1] 92 .35 92 .40 91 .65 88 .65 85 .55 85 .80 88 .40 105 .60
Part (d) [4 marks]
A Random Forest model has also been built for this data set, with some of the results given below.
Can this model be improved if more trees are constructed? Explain your answer.
> (r = randomForest(class ~ ., data=data, ntree=500, mtry=3))
Call:
randomForest(formula = class ~ ., data = data, ntree = 500, mtry = 3)
Type of random forest: classification
Number of trees: 500
No . of variables tried at each split: 3
OOB estimate of error rate: 38 .24%
Confusion matrix:
healthy mild severe class .error
healthy 122 11 4 0 .10949
mild 28 23 17 0 .66176
severe 9 22 2 0 .93939
> plot(r, lwd=3, ylim=c(0,1), col=1:4, lty=1)
> legend("topright", leg=colnames(r$err .rate), lwd=3, col=1:4, bg=NA)
Maximum marks: 16
7 Part (a) [4 marks]
The following graph shows a clustering result produced by a linkage method, with two clusters
indicated using two different colors for the two clusters . Out of the complete, single, average and
centroid linkage methods , which one is the most likely one used? Explain your answer.
Part (b) [4 marks]
Given a matrix x storing the points as its rows , write down the code based on the hclust function
that can be used to reproduce the clustering result and the graph in Part (a).
Maximum marks: 8
8 Part (a) [4 marks]
The classification plot of a support vector machine fitted to a data set is shown below. Is this an
over-fitted or under-fitted model? Explain your answer.
> dim(data)
[1] 100 3
> r = svm(class ~ ., data=data, scale=FALSE, kernel="radial", cost=1, gamma=1)
> plot(r, data)
Part (b) [4 marks]
To obtain a more appropriately fitted support vector machine, how would you like to adjust the
values of the parameters cost and gamma? Explain your answer.
Part (c) [4 marks]
If the argument kernel="radial" in Part (a) is replaced with kernel="linear", guess a
rough estimate of the training misclassifcation rate, based on the points in the plot shown in Part
(a). Explain your answer. (There is no duplicated value in the predictor variables .)
Maximum marks: 12
9 Part (a) [4 marks]
The neural network shown below is built and trained, with the neuralnet function, from a data
set in data frame data with a response variable class using all remaining variables are
predictors . Write some R code to reproduce the neural network and the plot.
Part (b) [4 marks]
In your own words , explain the meaning of the argument batch_size in the following function call
for deep learning, and find out how many times the gradient of the network is getting updated in
each epoch.
> dim(x)
[1] 5000 32 32 3
> history <- model %>% fit(x, y, epochs=30, batch_size=128, validation_split=0 .2)
Part (c) [4 marks]
If the value of epochs in Part (b) is increased from 30 to 1000, in what way would you expect the
two curves of loss in the following plot are going to change? Explain your answer.
> plot(history)
2023-10-18