闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ST2195 Programming for Data Science

Sample exam paper

2021-2022

Section A

1. Consider the following objects in Python: i=(2,3), j=[2,3] and k={2,3}. State whether of the following operation are possible (in Python). Justify your answers in one sentence. There is at least one correct statement, and negative marks apply for wrong choices.

(a) i[1]=4

(b) print(j+1)

(d) j[1]=4

● Marks: 6

2. In which of the circumstances below do ridgeline plots provide the most appropriate choice? Provide justiﬁcation for your answer in no more than two sentences.

(a) When we want to study the empirical density of a variable.

(b) When we want to compare frequencies of one variable across diﬀerent categories of another variable.

(d) When we want to explore the association between two continuous variables

● Marks: 6

3. Which of the statements below is correct. Provide justiﬁcation for your answer.

(a) When training a machine learning pipeline the main aim is to achieve a high training error. (b) When training a machine learning pipeline the main aim is to achieve a moderate training error.

(d) When training a machine learning pipeline a low training error may not be the primary aim.

● Marks: 6

4. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a) An IDE is an alternative operating system to Microsoft Windows.

(b) An IDE typically provides a source-code editor

(d) There are only 4 source-code editors for R and 3 for Python.

(e) A source-code editor for Python cannot be used for R

(f) An IDE is necessary for writing code.

● Marks: 6

5. Which of the following statements are correct? There is at least one correct statement, and negative marks apply for wrong choices.

(a) Jupyter notebooks cannot handle Python code.

(b) R Markdown is an authoring framework that combines Markdown with R (c) R Markdown ﬁles cannot be opened without installing R ﬁrst.

(d) Jupyter Notebooks are open-source web-browser based applications.

(e) Jupyter notebooks were named after the ﬁrst names of its creators, Julia and Peter.

(f) R Markdown ﬁles can be converted in a variety of formats including HTML, PDF, and Microsoft

Word documents.

● Marks: 6

6. Note from which language (R or Python) each of the following code chunks is from: C1. vec = c(1, 4, 7)

C2. paste("Hello", "world")

C3. import numpy

C4. library("mlr")

C5. phrase = "Hello world"; print(phrase.lower())"

C6. vec = (1, 4, 7)

C7. list("a", 5, 1:3)

C8. ["a", 5, (1, 2, 3)]

C9.

if (mark >= 50)

print("pass")

C10.

if mark >= 50:

print("pass")

C11. plot(1:5, 2:6)

C12. df = pandas.read_csv (fdate + ! .csv!)

C13. df = read.csv(paste0(fdate, ".csv"))

C14. head(fd)

C15. df.head()

C16. plt.subplots()

C17. ggplot(df, aes(x = x)) + geom_histogram()

C18. write.table(df, file = "df.csv")

C19. apply(df, 2, sum)

C20. df[-c(1, 3, 4), ]

● Marks: 10

Section B

1. For each of the following statements about R, state if they are always correct or not. Provide justiﬁcation for your answer of no more than two sentences.

A list is also a data frame.

A data frame is also a list.

Data frames can contain lists.

● Marks: 10

2. For each of the following statements about R, state if they are always correct or not. Provide justiﬁcation for your answer of no more than two sentences.

(a) The rows of a table in a relational database represent records.

(b) An attribube in a relational database is a tuple of rows.

(d) The SQL query

SELECT employee_id, salary, department

FROM employee

WHERE employee_id >= 102 AND salary >= 100

ORDER BY salary

returns all available records and attributes from the table employee that have employee_id greater

or equal to 102 and salary greater or equal to 100, ordered in increasing salary.

(e) The following R code chunk

inner_join(employee, company, by = "sector") %>%

filter(department == "HR")

Find all records in tables employee and company that have matching values of sector, and return only those records where department is “HR”.

● Marks: 10

3. Explain in no more than 2 sentences, why the following statements are wrong. (a) Git is a repository hosting service for GitHub.

(b) A Git repository cannot be accessed without an internet connection.

(d) Structured data are stored in a local hard drive, while unstructured data are in the cloud.

(e) CSV ﬁles are special instances of XML ﬁles.

(f) yz returns y modulo x.

(g) A dictionary in Python is a collection that is ordered, unchangable and indexed.

(h)

with elements 1, 2, 3.

(i) A data frame in R can only hold factors and numeric variables.

(j) Mutable objects in Python are objects whose value changes depending on the operations performed on them.

(k) ggplot2 is an R system for data wrangling.

● Marks: 10

4. Match the commands C1-C4 with the output in O1-O4. C1. git status

C2. print(type(2.3))

C3. paste("type", " !float ! ")

C4. git checkout master

O1. "type !float ! "

O2. <type !float !>

O3. Switched to branch !master !

O4. On branch master

● Marks: 10

5. Consider a data set consisting of the following variables on several customers of a bank:

● balance: credit card outstanding balance

● cleared: whether this balance was cleared in time

● student: whether the person is a university student

● income: the income of the person

(a) Describe what graphs you would produce to demonstrate how the variables balance, and income aﬀect the likelihood of the person clearing the credit card outstanding balance in time (b) Suppose that when you look at frequencies students tend to clear their outstanding credit card

balances in time less often than the rest of the population. But if you focus on people with high outstanding credit card balances, students are more likely to pay in time than the rest of the

population. Describe why could this be the case and what graphics you will use to depict that.

● Marks: 10

6. Suppose we are interesting in predicting a continuous variable y based on several features X . and we have a dataset with several missing values on X features. We are comparing two model learning models, namely ridge regression and random forests. Provide brief answers to the following:

(a) Provide the type of the machine learning model.

(b) Discuss the process of training the machine learning models indicating how the missing values will be handled.

● Marks: 10