Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS 769 Advanced Data Science Practice (Exam)

SEMESTER TWO 2022

Campus: City, Offshore Online,

STATISTICS

Advanced Data Science Practice

Time Allowed: 2 hours and 30 min additional time

NOTES

• Functions are often used in questions without specifying the packages they belong to. This simply means that they are the same ones as used in lectures and labs .

• Attempt all questions in Part A and Part B.

• The total marks for this examination are 100 marks

i    By submitting this assessment, I agree to the following declaration:

As a member of the University’s student body, I will complete this assessment with academic

integrity and in a fair, honest, responsible, and trustworthy manner. This means that:

   I will not seek out any unauthorised help in completing this assessment. Unauthorised help

includes , but is not limited to, asking another person, friend, family member, third party,

tutorial, search function or answer service, whether in person or online.

   I will not discuss or share the content of the assessment with anyone else in any form during

the assessment period, including but not limited to, using a messaging service,

communication channel or discussion forum , Canvas , Piazza, Chegg, third party website,

Facebook , Twitter, Discord, social media or any other channel within the assessment period.

   I will not reproduce and/or share the content of this assessment in any domain or in any form

where it may be accessed by a third party.

   I will not share my answers or thoughts regarding this assessment in any domain or in any

form within the assessment period.

   I am aware the University of Auckland may use Turnitin or any other plagiarism detecting

methods to check my content.

   I declare that this assessment is my own work , except where acknowledged appropriately

(e.g., use of referencing).

   I declare that this work has not been submitted for academic credit in this or another

University of Auckland course, or elsewhere.

I understand the University expects all students to complete coursework with integrity and

honesty. I promise to complete all online assessment with the same academic integrity standards

and values .

Any identified form of poor academic practice or academic misconduct will be followed up and may

result in disciplinary action.

I confirm that by completing this exam I agree to the above statements in full.

i   Questions 1-4 are for Part A of the course

Questions 5-9 are for Part B of the course -- data mining.

For Part B, functions are often used in questions without specifying the packages they belong to.

This simply means that they are the same ones as used in lectures and labs .

1   You are in a Unix shell showing the following output:

$ ls -l

total 0

drwx------ 2 usr01 datasci 142 Sep 23 11:27 data

drwxrwxr-x 2 usr01 datasci  34 Sep 23 11:27 scripts

$ ls -l data

total 1088620

-rw------- 1 usr01 datasci 501039472 Sep 23 11:25 accounting-2021-12 .csv

-rw------- 1 usr01 datasci 127162942 Sep 23 11:24 accounting-2022-01 .csv

-rw------- 1 usr01 datasci 486518821 Sep 23 11:25 accounting-2022-02 .csv

-rw------- 1 usr01 datasci        90 Sep 23 11:25 accounting-notes .txt

-rw-rw---- 1 usr01 datasci      9738 Sep 23 11:27 venues .csv

$ ls -l scripts

total 8

-rw-rw-rw- 1 usr01 datasci 240 Sep 23 11:26 code .R

-rwxrwxrwx 1 usr01 datasci  25 Sep 23 11:26 run .sh

Your unix username is usr01 and you are member of the datasci group. Write one or more

shell commands that ensure for your colleagues , which are also members of the group datasci,

all of the following:

   they are able to read the accounting-YYYY-MM .csv datasets where YYYY-MM is a year

and month, but not any other files

   they cannot write to any of the above files or directories

In addition, any other users outside of the datasci group should not be able to read, write or

access any of the above files or directories .

Maximum marks: 8

2    Let us assume you have a directory with many CSV files in the form YYYY .csv where YYYY is a year:

$ ls -l | head -n5

total 11747328

-rw-r--r-- 1 surb939 surb939 127162942 Jul 19 13:33 1987 .csv

-rw-r--r-- 1 surb939 surb939 501039472 Jul 19 13:33 1988 .csv

-rw-r--r-- 1 surb939 surb939 486518821 Jul 19 13:33 1989 .csv

-rw-r--r-- 1 surb939 surb939 509194687 Jul 19 13:33 1990 .csv

Here are the first few lines of such a CSV file (they make up the Flights table from the Airline

data used in the lecture):

$ head 1987 .csv | head -n5

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,

ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,

CancellationCode,Diverted,CarrierDelay,WeatherDelay,SDelay,SecurityDelay,LateAircraftDelay

1987,10,14,3,741,730,912,849,PS,1451,,91,79,,23,11,SAN,SFO,447,,,0,,0,,,,,

1987,10,15,4,729,730,903,849,PS,1451,,94,79,,14,-1,SAN,SFO,447,,,0,,0,,,,,

1987,10,17,6,741,730,918,849,PS,1451,,97,79,,29,11,SAN,SFO,447,,,0,,0,,,,,

1987,10,18,7,729,730,847,849,PS,1451,,78,79,,-2,-1,SAN,SFO,447,,,0,,0,,,,,

Write a shell script which uses GNU parallel with 6 parallel jobs to extract all rows from the CSV

files where the Origin or Dest column have the value SFO into a file all-SFO .csv. Note that

the resulting file is expected to be a valid CSV file (including exactly one header line). You can

assume that all the CSV files have the same column structure as shown above, but you cannot

assume that the string "SFO" will occur only in the desired columns .

(If you wish, you can use multiple shell scripts . If you do, separate them using comments showing

the file names ending with  .sh.)

Maximum marks: 12

3    Let us consider a subset of the NYC Taxi dataset used in the labs and lecture which just contains

the pick-up locations of taxi trips , consisting of two columns: longitude and latitude:

$ hadoop fs -cat /data/nyctaxi/pickup/2010/2010-01 .txt | head -n5

-73 .992818,40 .753273

-74 .015895,40 .711385

-73 .960341,40 .779052

-73 .97448,40 .793495

-73 .980407,40 .761308

The goal is to use Hadoop Map/Reduce to count how many trips start in each ZIP code (i.e.,

postal code) using R. You can assume that there is a function latlon2zip(lat, lon) which

takes the geographical coordinates of one or more locations and returns a character vector of the

same length as the number of locations with the resulting ZIP code:

> d = read .table(

pipe("hadoop fs -cat /data/nyctaxi/pickup/2010/2010-01 .txt | head"), FALSE, ",")

> latlon2zip(d[[2]], d[[1]])

[1] "10018" "10280" "10028" "10025" "10019" "10128" "10065" "10036" "10019"

Define the variables map and reduce in R such that the following code works and computes the

total number of points in each ZIP code efficiently. Explain each step of your code in comments

(including what are keys and what are values).

library(hmapred)

library(iotools)

r <- hmr(hinput("/data/nyctaxi/pickup/2010",

formatter=function(r) dstrsplit(r, list(lon=1, lat=1), sep=",")),

aux=list(latlon2zip=latlon2zip), map=map, reduce=reduce, reducers=6)

Maximum marks: 10

Can functions passed to mclapply() use any packages and variables present in the workspace

before mclapply was called?                                                            (Only if mc .cores= 1, Only if

the variables are exported, No, Yes , always)

You have a large CSV file consisting of just three numeric columns (real numbers). Let's say N is

the number of lines in the file and S is the number of bytes in the file. What is the best estimate of

the memory necessary to store the content of the CSV file as a data frame in R?

Which of the following values is reported  by system .time() and is the closest to the time the

user has to wait for the code to finish?                                   (user, system , elapsed, wall, cpu)

Given an XML file which starts like this:

<?xml version="1 .0" encoding="UTF-8"?>

<data>

<country code="NZ" name="New Zealand">

<value variable="population" year="1972">2903900</value>

which of the following XPath queries will select the year attribute of the value node when applied

(country/value/@year, data/country/value/@year, //value@year, //country/value[year])

Maximum marks: 10

5    Part (a)  [4 marks]

The naiveBayes function is applied to a data set. Based on the output below, find the predicted

class label for observation x = 4 .7. (You may want to use the dnorm function in your

calculation.)

> naiveBayes(class ~ x, data=data)

Naive Bayes Classifier for Discrete Predictors

Call:

naiveBayes .default(x = X, y = Y, laplace = laplace)

A-priori probabilities:

Y

A   B

0 .4 0 .6

Conditional probabilities:

x

Y     [,1]     [,2]

A 4 .2812 0 .48021

B 5 .0500 0 .55390

Part (b)  [4 marks]

The lda function is applied to the same data set used in Part (a). Based on the output below, what

is the predicted class label for a new observation x = 6? Explain your answer.

> predict(lda(class ~ x, data=data), data .frame(x = 5:4))$posterior

A       B

1 0 .20833 0 .79167

2 0 .80923 0 .19077

Part (c)  [4 marks]

By making use of the information provided as follows

> xm = as .matrix(data$x)

> table(data$class, knn(train=xm, test=xm, cl=data$class, k=10))

A  B

A 19 13

B  9 39

write down the output of the following R code:

> table(data$class, knn(train=xm, test=xm, cl=data$class, k=80))

Maximum marks: 12

6    Part (a)  [4 marks]

Find the four most important splits in the following pruned tree.

> (r = prune .tree(tree(class ~ ., data=data, mindev=0 .001), method="misclass", best=6))

node), split, n, deviance, yval, (yprob)

* denotes terminal node

1) root 238 450 .0 healthy ( 0 .576 0 .286 0 .139 )

2) x1: high,middle 108 230 .0 mild ( 0 .278 0 .472 0 .250 )

4) x8 < 0 .5 50  97 .0 healthy ( 0 .500 0 .380 0 .120 )

8) x7 < 2 .55 45  75 .0 healthy ( 0 .556 0 .400 0 .044 )

16) x5 < 23 .5 9  15 .0 mild ( 0 .111 0 .667 0 .222 ) *

17) x5 > 23 .5 36  46 .0 healthy ( 0 .667 0 .333 0 .000 )

34) x1: high 31  41 .0 healthy ( 0 .613 0 .387 0 .000 )

68) x5 < 30 .5 9  11 .0 mild ( 0 .333 0 .667 0 .000 ) *

69) x5 > 30 .5 22  26 .0 healthy ( 0 .727 0 .273 0 .000 ) *

35) x1: middle 5   0 .0 healthy ( 1 .000 0 .000 0 .000 ) *

9) x7 > 2 .55 5   5 .0 severe ( 0 .000 0 .200 0 .800 ) *

5) x8 > 0 .5 58 110 .0 mild ( 0 .086 0 .552 0 .362 )

10) x7 < 0 .3 10  22 .0 healthy ( 0 .400 0 .300 0 .300 )

20) x5 < 33 5   6 .7 healthy ( 0 .600 0 .000 0 .400 ) *

21) x5 > 33 5   9 .5 mild ( 0 .200 0 .600 0 .200 ) *

11) x7 > 0 .3 48  72 .0 mild ( 0 .021 0 .604 0 .375 )

22) x5 < 26 .5 26  30 .0 mild ( 0 .000 0 .731 0 .269 ) *

23) x5 > 26 .5 22  37 .0 severe ( 0 .045 0 .455 0 .500 )

46) x6: level 15  20 .0 severe ( 0 .000 0 .400 0 .600 ) *

47) x6: norm 7  13 .0 mild ( 0 .143 0 .571 0 .286 ) *

3) x1: low 130 150 .0 healthy ( 0 .823 0 .131 0 .046 ) *

> plot(r)

> text(r, pretty=0)

Part (b)  [4 marks]

Using the pruned tree shown in Part (a), find the predicted class label for the following observation:

x1 x2 x3 x4 x5   x6  x7  x8

high NA md NA NA norm 2 .4 0 .4

Part (c)  [4 marks]

Based on the following cross-validation study, how many leaf nodes does the best pruned tree

have? Explain your answer.

> r = tree(class ~ ., data=data, mindev=0 .001)

> prune .tree(r, method="misclass")

$size

[1] 29 21 17 11  5  3  2  1

$dev

[1]  48  48  50  56  65  74  80 101

$k

[1] -Inf  0 .0  0 .5  1 .0  1 .5  4 .5  6 .0 21 .0

$method

[1] "misclass"

attr(,"class")

[1] "prune"          "tree .sequence"

> rowMeans(simplify2array(mclapply(1:20, function( . . .) cv .tree(r, method="misclass")$dev, mc .cores=20)))

[1]  92 .35  92 .40  91 .65  88 .65  85 .55  85 .80  88 .40 105 .60

Part (d)  [4 marks]

A Random Forest model has also been built for this data set, with some of the results given below.

Can this model be improved if more trees are constructed? Explain your answer.

> (r = randomForest(class ~ ., data=data, ntree=500, mtry=3))

Call:

randomForest(formula = class ~ ., data = data, ntree = 500, mtry = 3)

Type of random forest: classification

Number of trees: 500

No . of variables tried at each split: 3

OOB estimate of  error rate: 38 .24%

Confusion matrix:

healthy mild severe class .error

healthy     122   11      4     0 .10949

mild         28   23     17     0 .66176

severe        9   22      2     0 .93939

> plot(r, lwd=3, ylim=c(0,1), col=1:4, lty=1)

> legend("topright", leg=colnames(r$err .rate), lwd=3, col=1:4, bg=NA)

Maximum marks: 16

7    Part (a)  [4 marks]

The following graph shows a clustering result produced by a linkage method, with two clusters

indicated using two different colors for the two clusters . Out of the complete, single, average and

centroid linkage methods , which one is the most likely one used? Explain your answer.

Part (b)  [4 marks]

Given a matrix x storing the points as its rows , write down the code based on the hclust function

that can be used to reproduce the clustering result and the graph in Part (a).

Maximum marks: 8

8    Part (a)  [4 marks]

The classification plot of a support vector machine fitted to a data set is shown below. Is this an

over-fitted or under-fitted model? Explain your answer.

> dim(data)

[1] 100   3

> r = svm(class ~ ., data=data, scale=FALSE, kernel="radial", cost=1, gamma=1)

> plot(r, data)

Part (b)  [4 marks]

To obtain a more appropriately fitted support vector machine, how would you like to adjust the

values of the parameters cost and gamma? Explain your answer.

Part (c)  [4 marks]

If the argument kernel="radial" in Part (a) is replaced with kernel="linear", guess a

rough estimate of the training misclassifcation rate, based on the points in the plot shown in Part

(a). Explain your answer. (There is no duplicated value in the predictor variables .)

Maximum marks: 12

9    Part (a)  [4 marks]

The neural network shown below is built and trained, with the neuralnet function, from a data

set in data frame data with a response variable class using all remaining variables are

predictors . Write some R code to reproduce the neural network and the plot.

Part (b)  [4 marks]

In your own words , explain the meaning of the argument batch_size in the following function call

for deep learning, and find out how many times the gradient of the network is getting updated in

each epoch.

> dim(x)

[1] 5000   32   32    3

> history <- model %>% fit(x, y, epochs=30, batch_size=128, validation_split=0 .2)

Part (c)  [4 marks]

If the value of epochs in Part (b) is increased from 30 to 1000, in what way would you expect the

two curves of loss in the following plot are going to change? Explain your answer.

> plot(history)