Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


Data Visualization and Wrangling Project

Intro to Data Science – Fall 2021


Purpose: Use your data visualization and wrangling skills to explore the dataset.


Background: The Washington Post compiles a database of every fatal shooting in the United States by a police officer in the line of duty since January 1, 2015. The Post is tracking more than a dozen details about each killing — including the race of the deceased, the circumstances of the shooting, and whether the person was armed — by culling local news reports and monitoring independent databases, such as Killed by Police and Fatal Encounters. In some cases, The Post conducted additional reporting. The Post is documenting only shootings in which a police officer, while on duty, shot and killed a civilian — circumstances that most closely parallel the 2014 killing of Michael Brown in Ferguson, Mo. The Post is not tracking deaths of people in custody, fatal shootings by off-duty officers, or deaths in which police gunfire did not kill the individual. The FBI and the Centers for Disease Control and Prevention log fatal shootings by police, but officials acknowledge that their data is incomplete.


Data Set: The data can be found at https://raw.githubusercontent.com/washingtonpost/data-police-shootings/master/fatal-police-shootings-data.csv. I recommend you download it and then read it into R using the read_csv() command. The following variables can be removed from the data and should not be used in any analyses below:
id
manner_of_death
threat_level
longitude
latitude, and
is_geocoding_exact

(If you do an analysis above and beyond the regular assignment, you are welcome to use these, as necessary).


Report: Your project report is due at 12:15pm on Wednesday, November 17th. You should submit the report on the course website as a single well-documented knitted file. There is no specific length to the report, but you should exclude unnecessary code and output. (In particular, please do not print the entire dataset or a large subset thereof). Please create one visualization, or a few related visualizations, per chunk and comment on it afterwards.


This project should be done alone. You should not consult any other people (be they live and in person or on the internet), except for Dr. Posner. For all graphs, comment on what you see and include appropriate labels and titles. You should use the tidyverse package to complete all tasks (you are welcome to use other packages as well, but all tasks must be done within R).


Tasks: Explore the data to discover patterns in shootings. You should do the following.

1) Univariate Summaries. Remember to add appropriate labels for all visualizations and describe the distributions.

a) Produce an appropriate graphical display for the following variables, each one by itself and describe the distribution of the data.

date, age, gender, race, city, state, body_camera

Note 1 - You are welcome (and encouraged) to create a visualization with a subset of the data or with the data modified in some way to get a better picture of the distribution.

Note 2 – Make sure to explore and discuss the missing values in each variable as well.

If you already completed any of parts b, c, and/or d in your work for part a, please do not leave it black, but rather add a comment in the part below that says something like “see part a above”.

In practice, it’s a good idea to do a univariate summary for all variables, but I am confident that if you can do it for these variables, you can do it just as well for the others.

b) Using the date variable, create visualizations for the i) day of the week and ii) year of the shootings. These variables can be derived from the date variable (Hint: use lubridate’s year and wday functions along with the label=T argument for day of the week).

c) For the body_camera variable, change the labels to “Yes” instead of TRUE and “No” instead of FALSE. (You should do this moving forward for this variable, as well as for signs_of_mental_illness, if you use them again)

d) For the race variable, change the categories into White (W), Black (B), Hispanic (H), and Other (all others, including O). Make sure to leave missing values as missing values.

e) For your graph in part 1d, use geom_text to include the values above each bar.


2) Bivariate and Multivariate Summaries. Remember to add appropriate labels for all visualizations and describe the distributions.

a) Produce visualizations using each of the following pairs of variables (one graph with two aesthetics).

i) body_camera and race - The goal here is to compare the percent of times a body camera was used and compare across races.

ii) age and race – explore the age distribution across different racial groups

iii) race and year – explore the distribution of race over time (by year)

b) Add gender as a facet to the graph you created in part 2.a.i. above. Describe what you see from the visualization.

c) Produce a visualization that uses three variables, one of which should be sign_of_mental_illness. Explain why you chose the variables and type of graph that you did. Describe what you see in the visualization.


3) Data Wrangling. Use the shooting data that you read into R in part 1 to create a dataset where the case is a state with the following derived variables. For each one, make sure to deal with missing values appropriately. For each variable, graph the values in descending order (Hint: the reorder function is a good choice here). Note, you do not need to include all states on all graphs, but the ones that you choose to show should be listed in descending order (for example, you could show only the top 10 in order or the top 5 in order along with the bottom 5 in order).

a) The number of shootings in that state.

b) The median age of people shot in that state.

c) The proportion of victims that were White.

d) The proportion of victims who were fleeing (based on flee, which includes all values except “not fleeing” but should exclude NAs).

e) Create a scatterplot of the proportion of victims that were White vs. the proportion of victims who were fleeing. In this graph, you should find a single outlier (influential point).

i Label this outlier with the name of the state.

ii Remove the outlier from the graph. State whether you think this is an appropriate choice.


4) Data Scraping. We are interested in scaling the data to include population and demographic information. Population per state can be found at https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population. “Racial breakdown of population by state” can be found at https://en.wikipedia.org/wiki/Demography_of_the_United_States (almost half the way down the page).

a) Read these tables into a data frame (or tibble) in R using the rvest package.

Note - If you can’t scrape the data using rvest, then you can copy and paste the tables into Excel and read them in that way or use some other method of your choosing. If you do not scrape them using rvest, you cannot get higher than a 3 on the rubric for this part.

b) Calculate the number of shootings per million people living in the state. Put these onto a graph in descending order. Note, you do not need to include all states on the graph. How does this graph compare to the one you created in part 3a above?

c) Create a scatterplot of proportion of victims who are White vs. proportion of the population that is White. You should have one case (point) per state. Comment on what you see.