闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STATS 769 Advanced Data Science Practice (Exam)

SEMESTER TWO 2022

Campus: City, Offshore Online,

STATISTICS

Advanced Data Science Practice

Time Allowed: 2 hours and 30 min additional time

NOTES

• Functions are often used in questions without specifying the packages they belong to. This simply means that they are the same ones as used in lectures and labs .

• Attempt all questions in Part A and Part B.

• The total marks for this examination are 100 marks

i By submitting this assessment, I agree to the following declaration:

As a member of the University’s student body, I will complete this assessment with academic

integrity and in a fair, honest, responsible, and trustworthy manner. This means that:

I will not seek out any unauthorised help in completing this assessment. Unauthorised help

includes , but is not limited to, asking another person, friend, family member, third party,

tutorial, search function or answer service, whether in person or online.

I will not discuss or share the content of the assessment with anyone else in any form during

the assessment period, including but not limited to, using a messaging service,

communication channel or discussion forum , Canvas , Piazza, Chegg, third party website,

Facebook , Twitter, Discord, social media or any other channel within the assessment period.

I will not reproduce and/or share the content of this assessment in any domain or in any form

where it may be accessed by a third party.

I will not share my answers or thoughts regarding this assessment in any domain or in any

form within the assessment period.

I am aware the University of Auckland may use Turnitin or any other plagiarism detecting

methods to check my content.

I declare that this assessment is my own work , except where acknowledged appropriately

(e.g., use of referencing).

I declare that this work has not been submitted for academic credit in this or another

University of Auckland course, or elsewhere.

I understand the University expects all students to complete coursework with integrity and

honesty. I promise to complete all online assessment with the same academic integrity standards

and values .

Any identified form of poor academic practice or academic misconduct will be followed up and may

result in disciplinary action.

I confirm that by completing this exam I agree to the above statements in full.

i Questions 1-4 are for Part A of the course

Questions 5-9 are for Part B of the course -- data mining.

For Part B, functions are often used in questions without specifying the packages they belong to.

This simply means that they are the same ones as used in lectures and labs .

1 You are in a Unix shell showing the following output:

$ ls -l

total 0

drwx------ 2 usr01 datasci 142 Sep 23 11:27 data

drwxrwxr-x 2 usr01 datasci 34 Sep 23 11:27 scripts

$ ls -l data

total 1088620

-rw------- 1 usr01 datasci 501039472 Sep 23 11:25 accounting-2021-12 .csv

-rw------- 1 usr01 datasci 127162942 Sep 23 11:24 accounting-2022-01 .csv

-rw------- 1 usr01 datasci 486518821 Sep 23 11:25 accounting-2022-02 .csv

-rw------- 1 usr01 datasci 90 Sep 23 11:25 accounting-notes .txt

-rw-rw---- 1 usr01 datasci 9738 Sep 23 11:27 venues .csv

$ ls -l scripts

total 8

-rw-rw-rw- 1 usr01 datasci 240 Sep 23 11:26 code .R

-rwxrwxrwx 1 usr01 datasci 25 Sep 23 11:26 run .sh

Your unix username is usr01 and you are member of the datasci group. Write one or more

shell commands that ensure for your colleagues , which are also members of the group datasci,

all of the following:

they are able to read the accounting-YYYY-MM .csv datasets where YYYY-MM is a year

and month, but not any other files

they cannot write to any of the above files or directories

In addition, any other users outside of the datasci group should not be able to read, write or

access any of the above files or directories .

Maximum marks: 8

2 Let us assume you have a directory with many CSV files in the form YYYY .csv where YYYY is a year:

$ ls -l | head -n5

total 11747328

-rw-r--r-- 1 surb939 surb939 127162942 Jul 19 13:33 1987 .csv

-rw-r--r-- 1 surb939 surb939 501039472 Jul 19 13:33 1988 .csv

-rw-r--r-- 1 surb939 surb939 486518821 Jul 19 13:33 1989 .csv

-rw-r--r-- 1 surb939 surb939 509194687 Jul 19 13:33 1990 .csv

Here are the first few lines of such a CSV file (they make up the Flights table from the Airline

data used in the lecture):

$ head 1987 .csv | head -n5

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,

ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,

CancellationCode,Diverted,CarrierDelay,WeatherDelay,SDelay,SecurityDelay,LateAircraftDelay

1987,10,14,3,741,730,912,849,PS,1451,,91,79,,23,11,SAN,SFO,447,,,0,,0,,,,,

1987,10,15,4,729,730,903,849,PS,1451,,94,79,,14,-1,SAN,SFO,447,,,0,,0,,,,,

1987,10,17,6,741,730,918,849,PS,1451,,97,79,,29,11,SAN,SFO,447,,,0,,0,,,,,

1987,10,18,7,729,730,847,849,PS,1451,,78,79,,-2,-1,SAN,SFO,447,,,0,,0,,,,,

Write a shell script which uses GNU parallel with 6 parallel jobs to extract all rows from the CSV

files where the Origin or Dest column have the value SFO into a file all-SFO .csv. Note that

the resulting file is expected to be a valid CSV file (including exactly one header line). You can

assume that all the CSV files have the same column structure as shown above, but you cannot

assume that the string "SFO" will occur only in the desired columns .

(If you wish, you can use multiple shell scripts . If you do, separate them using comments showing

the file names ending with .sh.)

Maximum marks: 12

3 Let us consider a subset of the NYC Taxi dataset used in the labs and lecture which just contains

the pick-up locations of taxi trips , consisting of two columns: longitude and latitude:

$ hadoop fs -cat /data/nyctaxi/pickup/2010/2010-01 .txt | head -n5

-73 .992818,40 .753273

-74 .015895,40 .711385

-73 .960341,40 .779052

-73 .97448,40 .793495

-73 .980407,40 .761308

The goal is to use Hadoop Map/Reduce to count how many trips start in each ZIP code (i.e.,

postal code) using R. You can assume that there is a function latlon2zip(lat, lon) which

takes the geographical coordinates of one or more locations and returns a character vector of the

same length as the number of locations with the resulting ZIP code:

> d = read .table(

pipe("hadoop fs -cat /data/nyctaxi/pickup/2010/2010-01 .txt | head"), FALSE, ",")

> latlon2zip(d[[2]], d[[1]])

[1] "10018" "10280" "10028" "10025" "10019" "10128" "10065" "10036" "10019"

Define the variables map and reduce in R such that the following code works and computes the

total number of points in each ZIP code efficiently. Explain each step of your code in comments

(including what are keys and what are values).

library(hmapred)

library(iotools)

r <- hmr(hinput("/data/nyctaxi/pickup/2010",

formatter=function(r) dstrsplit(r, list(lon=1, lat=1), sep=",")),

aux=list(latlon2zip=latlon2zip), map=map, reduce=reduce, reducers=6)

Maximum marks: 10

Can functions passed to mclapply() use any packages and variables present in the workspace

before mclapply was called? (Only if mc .cores= 1, Only if

the variables are exported, No, Yes , always)

You have a large CSV file consisting of just three numeric columns (real numbers). Let's say N is

the number of lines in the file and S is the number of bytes in the file. What is the best estimate of

the memory necessary to store the content of the CSV file as a data frame in R?

Which of the following values is reported by system .time() and is the closest to the time the

user has to wait for the code to finish? (user, system , elapsed, wall, cpu)

Given an XML file which starts like this:

<?xml version="1 .0" encoding="UTF-8"?>

<data>

which of the following XPath queries will select the year attribute of the value node when applied

(country/value/@year, data/country/value/@year, //value@year, //country/value[year])

Maximum marks: 10

5 Part (a) [4 marks]

The naiveBayes function is applied to a data set. Based on the output below, find the predicted

class label for observation x = 4 .7. (You may want to use the dnorm function in your

calculation.)

> naiveBayes(class ~ x, data=data)

Naive Bayes Classifier for Discrete Predictors

Call:

naiveBayes .default(x = X, y = Y, laplace = laplace)

A-priori probabilities:

A B

0 .4 0 .6

Conditional probabilities:

Y [,1] [,2]

A 4 .2812 0 .48021

B 5 .0500 0 .55390

Part (b) [4 marks]

The lda function is applied to the same data set used in Part (a). Based on the output below, what

is the predicted class label for a new observation x = 6? Explain your answer.

> predict(lda(class ~ x, data=data), data .frame(x = 5:4))$posterior

A B

1 0 .20833 0 .79167

2 0 .80923 0 .19077

Part (c) [4 marks]

By making use of the information provided as follows

> xm = as .matrix(data$x)

> table(data$class, knn(train=xm, test=xm, cl=data$class, k=10))

A B

A 19 13

B 9 39

write down the output of the following R code:

> table(data$class, knn(train=xm, test=xm, cl=data$class, k=80))

Maximum marks: 12

6 Part (a) [4 marks]

Find the four most important splits in the following pruned tree.

> (r = prune .tree(tree(class ~ ., data=data, mindev=0 .001), method="misclass", best=6))

node), split, n, deviance, yval, (yprob)

* denotes terminal node

1) root 238 450 .0 healthy ( 0 .576 0 .286 0 .139 )

2) x1: high,middle 108 230 .0 mild ( 0 .278 0 .472 0 .250 )

4) x8 < 0 .5 50 97 .0 healthy ( 0 .500 0 .380 0 .120 )

8) x7 < 2 .55 45 75 .0 healthy ( 0 .556 0 .400 0 .044 )

16) x5 < 23 .5 9 15 .0 mild ( 0 .111 0 .667 0 .222 ) *

17) x5 > 23 .5 36 46 .0 healthy ( 0 .667 0 .333 0 .000 )

34) x1: high 31 41 .0 healthy ( 0 .613 0 .387 0 .000 )

68) x5 < 30 .5 9 11 .0 mild ( 0 .333 0 .667 0 .000 ) *

69) x5 > 30 .5 22 26 .0 healthy ( 0 .727 0 .273 0 .000 ) *

35) x1: middle 5 0 .0 healthy ( 1 .000 0 .000 0 .000 ) *

9) x7 > 2 .55 5 5 .0 severe ( 0 .000 0 .200 0 .800 ) *

5) x8 > 0 .5 58 110 .0 mild ( 0 .086 0 .552 0 .362 )

10) x7 < 0 .3 10 22 .0 healthy ( 0 .400 0 .300 0 .300 )

20) x5 < 33 5 6 .7 healthy ( 0 .600 0 .000 0 .400 ) *

21) x5 > 33 5 9 .5 mild ( 0 .200 0 .600 0 .200 ) *

11) x7 > 0 .3 48 72 .0 mild ( 0 .021 0 .604 0 .375 )

22) x5 < 26 .5 26 30 .0 mild ( 0 .000 0 .731 0 .269 ) *

23) x5 > 26 .5 22 37 .0 severe ( 0 .045 0 .455 0 .500 )

46) x6: level 15 20 .0 severe ( 0 .000 0 .400 0 .600 ) *

47) x6: norm 7 13 .0 mild ( 0 .143 0 .571 0 .286 ) *

3) x1: low 130 150 .0 healthy ( 0 .823 0 .131 0 .046 ) *

> plot(r)

> text(r, pretty=0)

Part (b) [4 marks]

Using the pruned tree shown in Part (a), find the predicted class label for the following observation:

x1 x2 x3 x4 x5 x6 x7 x8

high NA md NA NA norm 2 .4 0 .4

Part (c) [4 marks]

Based on the following cross-validation study, how many leaf nodes does the best pruned tree

have? Explain your answer.

> r = tree(class ~ ., data=data, mindev=0 .001)

> prune .tree(r, method="misclass")

$size

[1] 29 21 17 11 5 3 2 1

$dev

[1] 48 48 50 56 65 74 80 101

[1] -Inf 0 .0 0 .5 1 .0 1 .5 4 .5 6 .0 21 .0

$method

[1] "misclass"

attr(,"class")

[1] "prune" "tree .sequence"

> rowMeans(simplify2array(mclapply(1:20, function( . . .) cv .tree(r, method="misclass")$dev, mc .cores=20)))

[1] 92 .35 92 .40 91 .65 88 .65 85 .55 85 .80 88 .40 105 .60

Part (d) [4 marks]

A Random Forest model has also been built for this data set, with some of the results given below.

Can this model be improved if more trees are constructed? Explain your answer.

> (r = randomForest(class ~ ., data=data, ntree=500, mtry=3))

Call:

randomForest(formula = class ~ ., data = data, ntree = 500, mtry = 3)

Type of random forest: classification

Number of trees: 500

No . of variables tried at each split: 3

OOB estimate of error rate: 38 .24%

Confusion matrix:

healthy mild severe class .error

healthy 122 11 4 0 .10949

mild 28 23 17 0 .66176

severe 9 22 2 0 .93939

> plot(r, lwd=3, ylim=c(0,1), col=1:4, lty=1)

> legend("topright", leg=colnames(r$err .rate), lwd=3, col=1:4, bg=NA)

Maximum marks: 16

7 Part (a) [4 marks]

The following graph shows a clustering result produced by a linkage method, with two clusters

indicated using two different colors for the two clusters . Out of the complete, single, average and

centroid linkage methods , which one is the most likely one used? Explain your answer.

Part (b) [4 marks]

Given a matrix x storing the points as its rows , write down the code based on the hclust function

that can be used to reproduce the clustering result and the graph in Part (a).

Maximum marks: 8

8 Part (a) [4 marks]

The classification plot of a support vector machine fitted to a data set is shown below. Is this an

over-fitted or under-fitted model? Explain your answer.

> dim(data)

[1] 100 3

> r = svm(class ~ ., data=data, scale=FALSE, kernel="radial", cost=1, gamma=1)

> plot(r, data)

Part (b) [4 marks]

To obtain a more appropriately fitted support vector machine, how would you like to adjust the

values of the parameters cost and gamma? Explain your answer.

Part (c) [4 marks]

If the argument kernel="radial" in Part (a) is replaced with kernel="linear", guess a

rough estimate of the training misclassifcation rate, based on the points in the plot shown in Part

(a). Explain your answer. (There is no duplicated value in the predictor variables .)

Maximum marks: 12

9 Part (a) [4 marks]

The neural network shown below is built and trained, with the neuralnet function, from a data

set in data frame data with a response variable class using all remaining variables are

predictors . Write some R code to reproduce the neural network and the plot.

Part (b) [4 marks]

In your own words , explain the meaning of the argument batch_size in the following function call

for deep learning, and find out how many times the gradient of the network is getting updated in

each epoch.

> dim(x)

[1] 5000 32 32 3

> history <- model %>% fit(x, y, epochs=30, batch_size=128, validation_split=0 .2)

Part (c) [4 marks]

If the value of epochs in Part (b) is increased from 30 to 1000, in what way would you expect the

two curves of loss in the following plot are going to change? Explain your answer.

> plot(history)

2023-10-18

Java

物理(Physical)

LINUX

C++

Python

Processing

sas

ios

maths

maple

C语言

R语言

Internet and World Wide Web

Principles of Programming Languages

sql

scheme

prolog

JavaScript

Haskell

essay

HDL

VBA

会计学(Accounting)

Rust

经济学（ Economics）

算法分析（Algorithm analysis）

MATLAB

Philosophy

Ethics

地理学（Geography）

Project management （管理学）

SysML

社会学（Sociology）

商业分析(Business Analysis)

市场营销学(Marketing)

人类学(Anthropology)

人文艺术(Arts and humanities)

电气工程（Electrical Engineering）

材料学（hylology）

生物科学（biological science）

哲学（Philosophy）

管理科学与工程类（Management science and Engineering）

工商管理（Business Administration）

数学（mathematics）

计算机（computer）

网络安全（Cyber Security）

统计学 Statistics

金融 Finance

经济与贸易 Economy and trade

Excel

Chemistry

LaTeX

OCaml

SPSS

Project

ASP

Stata

FORTRAN

Information system

SDLC

Basic

Digital Media

Biological

Android

ruby

HTML/CSS

Scala

PHP语言

STATS 769 Advanced Data Science Practice (Exam) SEMESTER TWO 2022