Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

PRELIMINARY EXAM 2022

ST2195

Programming For Data Science

4 March 2022

Section A

1. Define the following objects in Python: A = [2,3]

B = { python , r , python }

C = ( george , 6,A)

Briefly describe what is happening in each line of the following Python code stating also the type of data structure involved.

A[1] = 4

print(B)

C[2]

 Marks: 10

2. For each of the circumstances below discuss in no more than two sentences whether violin plots provide an appropriate choice.

(a) When we want to study the empirical density ofa variable.

(b)  When we want to comparefrequencies ofone variable across different categories ofanother variable.

(c)  When we want to monitor changes in the distribution of a variable across different categories of another variable.

(d)  When we want to explore the association between two continuous variables

Marks: 8

3. For each of the statements below state whether it is correct. Also, provide justification for your answer in one sentence.

(a)  In unsupervised learning we aim to minimise the training error.

(b)  Foracategorical input with 3 categories, it sufficesto produce 2 dummy variables.

(c)  In a classification task with a binary target, categories being‘negative’and‘positive’, the sensititivity is the probability of negative individuals being classified as negative.

(d)  Training a machine learning pipeline could involve the task of handling missing values.

Marks: 8

4. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a)  Jupyter notebooks cannot handle R code.

(b)  R scripts can only be executed withR Markdown.

(c)  Comments incode are useful for humansbutnot for computers.

(d)  Markdownis a programming language.

(e)  A for loop can be replaced by a while loop to perform the same operation.

(f)  A scatter plot will illustrate if a continuous variable is responsible for changes in another continuous variable.

 Marks: 6

5. Note from which language (R or Python) each ofthe following code chunks is from: C1. mat = rbind(c(1, 4),c(7,8))

C2.  import numpy

C3. ?lm

C4. b[-1]=1

C5. for (i in 1:4){k=1}

C6.

def triple(x):

k=3*x

return k

C7. f2=fˆ 2

C8.  if (mark >= 70){print( first )}

C9.  plot(x,y)

C10. apply(df, 2, mean)

C11. df = read.csv("data.csv")

C12.  df.describe()

C13. Auto = pd.read_csv("automobileBI.csv")

C14.

for i in range(0,K):

k[i]=f[i+4]

C15. hist(y)

C16.  plt.plot(x,y)

 Marks: 8

Section B

1. Using conditional statements write a programusing informal code (could be eitherR orPython or just plain words) that takes the list of numbers L=[2,4,3,6,7,11,12], checks if each of them is even or odd, counts how many of them are even and computes the sum of all even numbers as well as the sum of all odd numbers.

 Marks: 10

2. Describe what the following chunks of code are doing. (a)

SELECT student_ id, mark, department

FROM students

WHERE student_ id >= 200 AND average_mark >= 40

ORDER BY average_mark

(b)

inner_join(student, university, by = "country") %>%

filter(classification == "first")

 Marks: 10

3.  Explain in no more than 2 sentences, why the following statements are wrong. (a) XML files are usually smaller than JSON files.

(b)  A matrix inR can contain different types of data.

(c)  Python lists and dicts (dictionaries) can contain duplicate values.

(d)  In python the command print(A[2,1]), where A in numpy 2-dimensional array, will print the first element ofthe secondrow.

(e)  mlr3isan Rplatform used mainly for data visualisation.

(f)  Histograms and kernel density plots can be used to help us understand the shape of the distribution of categorical variables.

(g)  Side by side boxplots provide information about the association of two continuous variables.

(h)  The command matrix(1:30, nrow = 6) in R will create a matrix with 6 rows and 4 columns.

(i)  A pandas data frame can only hold factors and numeric variables.

(j)  In Python consider a list L that contains only real numbers. We can increase all its elements by 1 using the command L+1.

 Marks: 10

4. Match the commandsC1-C4 with theoutput in O1-O4.

C1. print(type( exam ))

C2.  print(type(3))

C3. paste("<class", " str >")

C4. paste("<class", " int >")

C5. K = 5;K O1. "<class O2. 5

O3. "<class O4. <class O5. <class

Marks: 10

str >"

int >"

str >

int >

5. Consider a data set consisting of several insurance claims on automobile injuries that contains the following variables:

•  claim: the amountclaimed by the policyholder

•  attorney: whether an attorney was presentwhen the claim was made (1: yes, 0:no)

•  gender: 1:female, 2:male, 3: not disclosed

  years_driving: ageminus age the person obtained their driving licence.

(a)  Describe what graphs you would produce to demonstrate how the presence of an attorney, gender and years of experience affect the amounts claimed by the policyholders.

(b)  There exists a debate whether gender information should be included in the procedure of pricing insurance premiums. Suppose you had such data in your posession and were asked to comment on the whether gender information has predictive ability on the claims filed. What plots would you use to extract relevant information from these data?

Marks: 10

6. Rewrite the following script (in which f is a numerical vector) by replacing the while loop with a for loop in a way so that the code does exactly the same thing.

count = 0;

fsum = 0;

half_f_sum = 0.5*sum(f);

while (fsum < half_f_sum){

count = count + 1;

fsum = fsum + f[count];

}

Marks: 10