闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DS 111, Summer 2022

Homework 3

1. A researcher is conducting a study to understand approval of Eric Adams, the mayor of New York City, among NYC residents. To do this, they stand in Times Square on a Saturday afternoon and invite passersby to participate in a brief survey. They ask anyone who stops to talk to them to indicate how supportive they are of the policy changes Mayor Adams has made since taking oﬃce. The possible replies are, “very”, “somewhat”, “not really”, “not at all”, and “prefer not to say/no opinion/don’t know.”

(a) How is this researcher conceptualizing “approval” of Mayor Adams? (b) How is this researcher operationalizing “approval”?

(c) What is one strength of this measure of “approval”? Brieﬂy explain why it’s a strength. (d) What is one weakness of this measure of “approval”? Brieﬂy explain why it’s a weakness. (e) What is one possible source of random error in the resulting dataset from this study? (f) What is one possible sources of selection bias in this study?

2. Read the article“Why‘Anonymous’Data Sometimes Isn’t”and answer the questions that follow.

(a) Is de-anonymization only a concern when a large number features are associated with each observation? Brieﬂy explain your answer.

(b) Brieﬂy discuss the trade-o↵s between privacy and taking an intersectional approach to

research.

(c) What are some ethical concerns you might have about conducting research in the digital age? Name at least ONE principle from the Belmont Report and describe EITHER why it is a potential concern OR why it is not a potential concern.

(d) Some data may be more sensitive than others. If a data set is unlikely to result in a serious negative outcome (such as identify theft), do you think it’s okay if a small portion of records can be re-identiﬁed? Brieﬂy discuss why or why not.

(e) Some data sets may reﬂect existing social biases. Brieﬂy discuss the potential harms of

such biased data as input to an algorithm. Do you think such data can be successfully “de-biased”?

3. This and all following questions will use a dataset related to the FIFA 2022 videogame. This data was scraped fromhttps://sofifa.com by Stefano Leone. This originating website can serve as a codebook of sorts, providing more information about the speciﬁc features included in our dataset. The dataset can be found as a .csv ﬁle on Brightspace > Assignments > Homework 3 (where you found this assignment). In this question, we will load and inspect our dataset.

(a) Load the dataset fifa22 .csv and display the ﬁrst 5 rows. It is okay if not all of the

columns are visible.

(b) What is the unit of analysis in this dataset?

(d) The “gender”column provides the gender of the player, with binary options of “M”for male and “F”for female. How many male players are in the dataset and how many female players are in this dataset?

(e) This dataset includes all playable characters in the videogame FIFA 2022. Do you think

this dataset is representative of the real-world population of professional football/soccer players? Brieﬂy explain why or why not.

(f) One last piece of data cleaning before we get started with analysis . Our dataset includes

many missing values, particularly for female players . This means that we don’t want to drop all rows where any value is NaN, as that will disproportionally remove women from our dataset . Instead, we want to be more targeted: drop only rows where the column passing contains a missing (NaN) value . The ﬁnal dataframe – which you should use for all future questions – should have 17,450 observations . Show your code for drop- ping these rows and display the shape of the ﬁnal dataset .

4. In this question we’ll evaluate associations and correlations between variables .

(a) Display a scatterplot with passing shown along on the horizontal axis, and rank shown

on the vertical axis .

(b) Based only on looking at the scatterplot, comment on the (i) direction, (ii) strength,

and (iii) linearity of the association between passing and rank.

(d) Notice that the correlation between passing and rank is the same as the correlation between rank and passing. What is the speciﬁc term from lecture to describe this feature of correlation?

(e) Consider the correlation between skill and rank. Is it stronger or weaker than the

correlation between passing and rank?

(f) The fact that we are able to compare correlations between variables that measure dif-

ferent things is another feature of correlation . What is the speciﬁc term from lecture for this feature?

5. In this question, we will perform multiple linear regression using the statsmodels package .

(a) Use the statsmodels package to estimate a multiple regression evaluating the e↵ect on

rank of four features: passing, attacking, defending, and skill . Display the output .

(b) How much of the variance in rank is explained by our features?

(d) Holding passing, attacking, and skill constant, a 1-unit increase in “defending” is asso- ciated with what kind of change in ranking?

6. Now that we’ve gotten to know our data a little bit, we will use SciKit Learn and a test/train split to see how well our model – using the same DV and IVs as Q5 – can predict a player’s rank.

(a) Based on the statsmodels output from Q5, do you expect that these four features (pass-

ing, attacking, defending, and skill) will do a pretty good or pretty bad job at predicting rank for out-of-sample data? Brieﬂy explain why or why not.

(b) Create an X dataframe with just four features: passing, attacking, defending, and skill.

Create a Y dataframe (or series) with just the “rank” variable. Display the ﬁrst ﬁve rows of each.

(c) Create a test/train split where 25% of the data is held out for testing. Use a random seed of 123 (i.e., set the random state to this value). To show your code has worked, display the ﬁrst 5 rows of the X training data.

(d) Use SKLearn to train a linear regression using only the training data. Display the inter- cept and coeﬃcients for your trained model (coeﬃcients do not need to be labeled).

(e) Compare the coeﬃcients estimated by both regression models. How does the coeﬃcient for “attacking” change (if at all) when it is estimated in Q5 (using statsmodels and the full dataset) vs when it is estimated in Q6 (using SKLearn and just training data)?

(f) Use your trained SKLearn regression model to predict rank values for the hold-out set of

X test data. Display at least the ﬁrst three predicted values (in a format of your choice).

(g) Display a scatterplot in which the horizontal axis shows the actual value of the Y test

data and the vertical axis displays the predicted Y values for the X test data.

(h) Calculate and display the Root Mean Squared Error (RMSE) for this model. Provide a

brief interpretation of what this means in terms of the “average error” of the model.

(i) Reﬂecting on any of the analyses conducted above, do you feel that this model does a good job or a bad job of predicting player rank?

7. In this next question, we’ll use KNN to try to classify players’ “preferred foot.”

(a) First, let’s get a better sense of the balance of classes in our data (eg, how many observa-

tions of each class we have). Display the count of each value present in the preferred foot column.

(b) If we were to build a classiﬁer which always guessed a player preferred their right foot, what percentage of the time would we make a correct classiﬁcation? In other words, what percent of players actually do prefer their right foot?

(c) Let’s build a classiﬁer using 10 available dimensions: shooting, passing, dribbling, de- fending, attacking, skill, movement, power, mentality, and goalkeeping. Create an X dataframe with just these 10 columns and display the ﬁrst 5 rows.

(d) Now, rescale (or normalize) this X data so that each IV has a mean of 0 and a standard deviation of 1. Display (at least) the ﬁrst three rows of normalized data. (Note: we’ll use this rescaled data again in Q8a.)

(e) We’ll want to be able to see how well our classiﬁer performs out of sample, so now create a test-train split, setting Y to be the “preferred foot” column of the dataframe. Here, use 30% of the data for testing and set the random state to 456. Display (at least) the ﬁrst 3 rows of X training data.

(f) Next, we’ll want to determine the number of neighbors k to consider for our KNN classi-

ﬁer. For values of k from 1-30 (inclusive), calculate the error of a KNN classiﬁer. Display your results by creating a plot with considered k values along the horizontal axis and the corresponding error displayed along the vertical axis.

(g) Based on your analysis, choose a reasonable value of k . Train a KNN classiﬁer that

considers this number of neighbors and predict Y values (preferred foot) for your out of sample test data. Display (at least) the ﬁrst 3 predictions for “preferred foot.”

(h) Use the actual and predicted Y values to calculate and display the confusion matrix for

your model. This will display without labels, but will show the classes in alphabetical order (Left, Right; upper left corner is “Left-Left”). As with the examples in lecture, the rows will indicate the actual values and the columns will indicate the predicted values. Approximately how many players who actually prefer their left foot (“True Lefts”) were predicted to prefer their right foot?

(i) Use the actual and predicted Y values to display the full classiﬁcation report. What does the recall for the classiﬁcation “Left” suggest about our model?

(j) Reﬂecting on the analysis above, do you feel like this model does a good job or a bad

job of predicting a player’s preferred foot?

8. Finally, we will examine natural clusters in our data using K-Means. As our X data, we’ll use the same 10 scaled features that we used for KNN (from Q7d).

(a) K-Means is a very computationally intensive algorithm, if we try to run it on our full

dataset it will kill our kernel. So, let’s start by doing some additional pre-processing in

order to run our analysis on only a sample of the data. Take the scaled X values that you calculated in question 4d and convert this to a dataframe. Display the ﬁrst 5 rows.

(b) Now, randomly sample the rows of this dataframe using panda’s .sample() function. Se-

lect n = 5000 rows and set the random state to 2022. Save the sampled rows as a new dataframe and display the ﬁrst 5 rows.

(c) Next, calculate and save the error (inertia) and the silhouette score for possible values of k between 2 and 20 (inclusive). Set the random state to 23 for every value of k con- sidered. Display the resulting list of error values. Note: Be sure to run this on just the sample of 5000 observations you created for Q8b. It will not successfully run if you use the whole dataset.

(d) Plot the error (inertia) for each value of k . Your plot should show considered values of k along the horizontal axis and corresponding error/inertia along the vertical axis.

(e) Use the kneed package to estimate the bend or “elbow” in the curve. Display the sug-

gested elbow value.

(f) Plot the Silhouette Score for each value of k . Your plot should show considered values

of k along the horizontal axis and corresponding Silhouette Scores along the vertical axis.

(g) Based on the above analysis, choose a reasonable value of k and run KMeans using the scaled X data, sampled to 5k observations (i.e., from Q8b). Add a column to your X dataframe indicating the cluster assignment/label. Display the ﬁrst 5 rows of this dataframe.

(h) Now, create and display a plot which shows a player’s attacking score on the horizontal

axis and their defending score on the vertical axis. Color each observation based on its assigned cluster.

(i) Reﬂecting on the above analysis, do you think clustering is a meaningful technique for this data? What further analyses would be you interested in running? (You don’t actu- ally have to run anything else!)