关键词 > DataC100/C200

Data C100/C200, Final Spring 2022

发布时间：2023-04-27

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Data C100/C200, Final

Spring 2022

1 Coo. . . You Want Pigeon Milk in That? [15 pts]

Brewster, a coffee barista, has opened a new caf called “The Roost.” You are tasked with helping him manage his daily sales. You are provided with a pandas DataFrame roost of today’s cus- tomer transactions, the first 6 rows of which are shown below. Note that customers can only buy one drink at a time.

(a) [3 Pts] “The Roost” sells three drinks, but only two are listed in the roost DataFrame. It seems Brewster one-hot encoded the column representing the drink the customer bought.

However, he accidentally dropped one of the columns. Update roost with the missing col- umn representing the third drink, “Black Coffee with Pigeon Milk.”

After running your code, the first 6 rows of the updated roost DataFrame should be:

roost['Black Coffee with Pigeon Milk'] =

(b) [3 Pts] Suppose we want to build a linear model that predicts the column you just added

(“Black Coffee with Pigeon Milk”):

yˆ = θ 1 x1 + θ2 x2 + θ3 x3

Let features x1 be Price, x2 be Black Coffee, x3 be Iced Tea, and yˆ be your pre- diction of Black Coffee with Pigeon Milk. What would be the optimal parameter estimates θ = [θ1 , θ2 , θ3]T such that the model predicts yˆ exactly equal to the “Black Coffee with Pigeon Milk” column?

⃝ A. [1/2, 1/20, 1/20]T

⃝ B. [1/20, − 1/2, − 1/2]T

⃝ C. [1/2, − 1/2, − 1/2]T

⃝ D. [1/20, 1/20, 1/20]T

⃝ E. [1/2, 1/20, − 1/2]T

⃝ F. [1/10, − 1/2, 1/10]T

⃝ G. [1/2, 1/10, 1/10]T

(c) [4 Pts] Suppose Brewster wants to transform the one-hot encoded data into a new DataFrame that has just three columns: the customer name Customer, drink price Price, and drink name Drink. The resulting DataFrame roost transactions should as follows:

Complete the following skeleton code. (Hint: Documentation for melt is on the following page.)

df = roost .melt([__ (i)__ ])

df = __ (ii)__ .drop(columns=['value'])

df = df .rename(columns={'variable':'Drink'}) roost_transactions = df .reset_index(drop=True)

(i): ________________________________________________________

(ii): ________________________________________________________

Here is a refresher on the Pandas melt method, paraphrased from Project 2A:

(d) [5 Pts] Finally, Brewster needs your help to construct a pivot table from roost transactions that looks similar to the below table. The resulting pivot table should contain a column for each type of drink, where each record now represents a unique customer and how much they spent on each type of item.

roost_transactions .pivot_table(__D__ ) .fillna(0)

Fill in the blank for D:

2 Excellent Purchase! (...Purchase!) [15 pts]

(a) [6 Pts] Isabelle and Tom Nook have opened up competing stores on an island. They’ve each collected data, and they have asked you to help determine who is the better salesperson.

Isabelle and Tom collect data in two separate tables; each table has one record for every resident on the island, as well as how much that resident spent at that store in a particular day. You are provided two tables: one for Isabelle’s store (Isabelle) and one for Tom’s store (Tom). The first five rows of each table are shown below:

工sabelle

Tom

Construct a SQL query below that returns just one row that contains the total number of wins Isabelle got. Someone gets a ‘win’ if they get a resident to spend more than the other person. For example, Apollo spent more with Isabelle than Tom, so that would represent one win for Isabelle. For the first five rows, Isabelle has a total of 3 wins, since Apollo, Maddie, and Ursula spent more money at Isabelle’s store.

Construct this SQL query using the skeleton code below. The result for the first five rows is shown to the right.

SELECT _ (i)_ AS isabelle_wins

FROM __ (ii)__ INNER JOIN Isabelle

_ (iii)_

(i): _______________________________________________________

(ii): _______________________________________________________

(iii): _______________________________________________________

ON tom .resident_id = isabelle .resident_id

AND isabelle .total_spent > tom .total_spent

SELECT SUM(isabelle .total_spent > tom .total_

FROM tom INNER JOIN isabelle

ON isabelle .customer_name = tom .customer_name

(b) [9 Pts] Isabelle and Tom also jointly own an online shopping catalogue. Residents can pur- chase items from the catalogue only if they have a bank account.

Pelly, the island’s accountant, maintains a SQL table Account that contains all owners of bank accounts and the monetary amount in each account. Isabelle and Tom also have a shared SQL table Transaction of attempted transactions from customers through their online catalogue. Both tables are shown below:

Account

Transaction

Construct a SQL query below that returns the

output table to the right, which contains a col-

umn approved that indicates if a customer

has enough money stored in their bank account

to complete all the transactions they are at-

tempting: 1 (True), 0 (False), or NULL/None

(has a bank account, but did not make any

transactions).

• All residents with bank accounts should

be in the output table.

• If a customer attempts to make a transaction but they don’t have a bank account, they should not be included in the output table (e.g., Goldie is not included).

• The input tables have no NULL/None values in them.

Construct this SQL query using the skeleton code below.

SELECT a .resident_id, a .resident_name, _ (i)_ AS approved FROM Account AS a _ (ii)_ JOIN Transaction AS t __ (iii)__

(i): _______________________________________________________

(ii): ⃝ INNER ⃝ LEFT ⃝ RIGHT ⃝ OUTER ⃝ CROSS

(iii): _______________________________________________________

3 Jolly Redd’s Visualization Trawler [8 pts]

You have just bought some visualizations from the shady visualization seller, Jolly Redd. However, there are some flaws in the visualizations he’s sold you.

While we have some clear answers in mind, we will be lenient when grading this question. As long as you correctly identify some flawed aspect of the image and explain your answer thoroughly, you will receive full credit. Answers such as ”there is no flaw” or answers about the underlying data (and not the image itself) will receive no credit. Please keep your answers concise—nothing more than a couple of sentences.

Note that the titles for these plots are given directly above the image—answers such as “bad title” or “missing title” will also receive no credit.

(a) [4 Pts]

List two aspects of this plot that are incorrect or misleading.

(b) [4 Pts] Below, “Console Type” refers to PC, Playstation and Xbox; and “Video Game Genre”

refers to RPGs, Action, and Shooter.

Describe two flaws with this plot.

4 Dr. How Do You Spell That? [8 pts]

You are working with a British researcher, Dr. Who, to file a lab report in the Data 100 Research Journal. Unfortunately, the journal only accepts American versions of words, but Dr. Who’s lab report uses British words. For example “colour” is a British spelling of the word “color” .

(a) [3 Pts] First, determine the output of the re .findall statement below.

>>> sentences = [

"Please analyse the catalysers . ",

"You can't psychoanalyse a lysed protein . ",

]

>>> pattern = r"\w\wyse"

>>> [re .findall(pattern, s) for s in sentences]

. . . [['alyse', 'alyse'], ______X______ ]

What result will be in the blank marked with an X?

⃝ A. []

⃝ B. ['alyse']

⃝ C. ['lyse']

⃝ D. ['alyse', 'lyse']

⃝ E. ['lyse', 'lyse']

(b) [5 Pts] Now you try to identify some British spelling idiosyncrasies in Dr. Who’s writing.

Write a regular expression that will find all words that contain “our”, excluding words with 0 or 1 letters before “our” . In other words, your regular expression should find “colour”, and “favourite”, but should not find “our”, or “dour”, as they have 0 and 1 letter preceding the “our”, respectively. Unlike part (a), you want the results of your findall to be the entire word, not just some of the letters, as shown in the example below.

>>> sentences = [

"Our favourite colour is blue . ",

"I am four hours away from the harbour . ",

"I am enamoured with our tour of the arbour . "

]

>>> pattern = r"__B__ "

>>> [re .findall(pattern, s) for s in sentences]

. . . [['favourite', 'colour'], ['harbour'], ['enamoured', 'arbour']] Fill in the blank for B such that the above code works correctly.

pattern = r"________________________________ "

5 Back to the Future [3 pts]

You successfully sneaked in a survey on KPop groups and a survey on cats vs dogs on this semester’s Data 100 exams! Let’s do a math problem on the result of the survey.

(a) [3 Pts] Recall the definition of a multinomial probability from lecture:

If we are drawing at random with replacement n times, from a population broken into three separate categories (where p1 + p2 + p3 = 1):

• Category 1, with proportion p1 of the individuals.

• Category 2, with proportion p2 of the individuals.

• Category 3, with proportion p3 of the individuals.

Then, the probability of drawing k1 individuals from Category 1, k2 individuals from Category 2, and k3 individuals from Category 3 (where k1 + k2 + k3 = n) is:

p1(k)1 p2(k)2 p3(k)3

From the original results of your survey, you learn that 14% of Data 100 students are BTS fans and 24% of Data 100 students are Blackpink fans and the rest are fans of neither. Suppose you randomly sample with replacement 99 students from the class. What is the probability that the students are evenly distributed between the three different groups? Please leave your answer as an expression; there is no need to fully calculate it out.

(b) [0 Pts] Which of the surveys did you prefer? Select your favorite between the two.

⃝ A. BTS/Blackpink survey from Midterm 1.

⃝ B. Cats vs Dogs Drawing survey from Midterm 2.

⃝ C. Neither.

6 Election Perfection [22 pts]

Ariel has recently joined a data science company in Phosphorus Valley to perform data analysis for their elections. To start off, Ariel’s first task is to create a linear model that performs one task: The model must perfectly predict the winners of the local election that has 2 candidates given their vote share.

Ariel was provided the results of the past 150 elections, a sample of which is shown below. As- sume for this question that the largest vote share wins the election (i.e., strictly more than 50%), and that there are never any ties (i.e., Candidate 0 and 1 will never both get 50% of the vote share).

Candidate 0

Candidate 1

Winner

0.35

0.12

0.51

0.65

0.40

0.65

0.88

0.49

0.35

0.60

Candidate 1 Candidate 1 Candidate 0 Candidate 0 Candidate 1

0.51

0.65

0.40

」

0.49

0.35

0.60

l 1(1) 」

Above, Ariel has constructed a design matrix X, where the first feature represents Candidate 0’s vote share and the second feature represents Candidate 1’s vote share in each election, and an output vector Y, which represents the election winners (1: Candidate 1; 0: Candidate 0). Help Ariel predict the election winners given the design matrix X and your linear modeling toolkit.

(a) [5 Pts] Suppose you limit the training data matrix to two arbitrary training points and con-

struct the 2x2 data matrix X2 and the corresponding election winners Y2 . Fill in the blank: If we train a linear regression model fit to the training data X2 and Y2 , we can _________ achieve zero training loss.

⃝ A. always ⃝ B. sometimes ⃝ C. never

Justify your answer:

(b) [4 Pts] Ariel decides to improve the above model by training on the entire dataset of 150 past elections. However, Ariel decides to use ordinary least squares (OLS) with a a design matrix , which is the X matrix with an added intercept feature, to compute θˆ using the normal equation: θˆ = (T )− 1 T Y. For your convenience, a sample of and Y are shown below.

l 1(1)

= 1

「...

0.35

0.12

0.51

0.65

0.40

...

」

0.49

0.35

0.60

l 1(1) 」

Which of the following are true about this modeling approach?

□ A. T Y is of size (150, 3).

□ B. T Y is a vector.

□ C. Ariel cannot achieve perfect accuracy nor zero loss on the training set using this approach.

□ D. Using L2 regularization with λ > 0 is always a better choice than OLS for the dataset above.

(c) [4 Pts] Ariel decides to switch to a logistic regression model and also reverts to the original 2-feature design matrix X, with no bias term. A sample of X and Y is shown below:

l」 l1(1) 」

0.51 0.49 0

0.65 0.35 0

0.40 0.60 1

□ A. The design matrix is not full column rank.

□ B. Logistic regression outputs the probability that Candidate 1 wins the elec- tion.

□ C. There is no uniquely optimal parameter value θˆfor logistic regression with this design matrix using binary cross-entropy loss.

□ D. Logistic regression would usually be used for this binary classification task.

(d) [5 Pts] Regardless of your previous answer, suppose that Ariel fits a logistic regression model to the data in part (c). Provide an optimal parameter θˆ such that we can achieve the maximum training accuracy possible for this data. What is the maximum possible training accuracy? Justify your answer.

Maximum possible

Optimal θˆ: _______________ training accuracy: ____________

Justification:

(e) [4 Pts] One of Ariel’s friends from Data 100 mentions that decision trees almost always

achieve perfect accuracy, at least on the training dataset! As their final modeling approach, Ariel fits a decision tree to the 2-feature design matrix X and output Y as described in part (c).

What is the height of the optimal decision tree if we use weighted node cross-entropy loss? Justify your answer. Define the ”height” of a decision tree as the maximum number of yes/no questions (a.k.a. splits) that can possibly be asked before arriving at a prediction.

Height of optimal decision tree: _______________

Justification:

7 Error Fn: Broken [21 pts]

Alex has compiled a list of broken loss functions and requests your help to diagnose drawbacks.

For each of the loss functions L below, assume a linear model yˆ = fθ (北) = θT 北. Then, the optimal parameter θˆ is a real-valued vector that minimizes the empirical risk 对 L(yi , yˆi ) with respect to θ on a n-point dataset, where datapoint i has input 北i and output yi .

(a) [3 Pts] Which of the following are true of the following cubic loss function L1 when we minimize the associated empirical risk with respect to θ?

L1 (y, yˆ) = (y − yˆ)3

□ A. There are infinitely many optimal parameters for this loss function.

□ B. Gradient descent on this loss function diverges, i.e. approaches one or more parameters of infinite magnitude.

□ C. This loss function is non-convex.

□ D. This loss function is non-differentiable on at least one point in its domain.

(b) [3 Pts] Which of the following are true of the following negative squared loss function L2 when we minimize the associated empirical risk with respect to θ?

L2 (y, yˆ) = − (y − yˆ)2

□ A. There are infinitely many optimal parameters for this loss function.

□ B. Gradient descent on this loss function diverges, i.e. approaches one or more parameters of infinite magnitude.

□ C. This loss function is non-convex.

□ D. This loss function is non-differentiable on at least one point in its domain.

(c) [3 Pts] Which of the following are true of the following loss function L3 that calculates accu- racy when we minimize the associated empirical risk with respect to θ?

L3 (y, yˆ) =

□ A. There are infinitely many optimal parameters for this loss function.

□ B. Gradient descent on this loss function diverges, i.e. approaches one or more parameters of infinite magnitude.

□ C. This loss function is non-convex.

□ D. This loss function is non-differentiable on at least one point in its domain.

Alex now provides you with the sample dataset shown to the right, with one feature x and response y.

(d) [6 Pts] Suppose we use a simple linear regression model with an intercept term, i.e., yˆ = θ0 + θ 1 x with the loss function L4 (shown right). What is the optimal θˆ = (θˆ0 , θˆ1 ) ∈ R2 that mini- mizes the empirical risk 对 L4 (yi , yˆi ) on the sample dataset? Justify your answer.

x y

36 4

25 9

16 16

L4 (y, yˆ) = |yˆ| − |y|

θˆ0 : ___________ θˆ1 : ___________