AD699: Data Mining for Business Analytics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
AD699: Data Mining for Business Analytics
Individual Assignment #1
To submit, you will upload two files: The .R script that you used to store your coding steps, and a PDF of your write-up. You may wish to use a reporting tool called R Markdown, but this is not a requirement.
As always, remember to take advantage of your available resources: omce hours, recitation, the textbook, the video library, the Internet, etc. As the course slogan says, “Get After It!”
Throughout the assignment, you can assume that every filtering step will “flow” into the next steps.
Once you have filtered the dataset for some particular purpose, you will use the filtered version from that point forward. Every essential filtering step is written in bold letters in the prompt.
There are no style points in 699. In other words, if your code accomplishes the stated objective, then it’s completely fine to use it. You can use something from class material, the book, the web, or any other source (as long as it works!)
This prompt may not always refer to variables in the exact same way that the dataset does. This is very realistic. When in doubt, the names() function can be helpful for seeing variable names, including their quirks (capital letters, lowercase letters, punctuation marks, etc.)
Wherever you see a question, answer it using full sentences. The code needed for each step should also be shown in your write-up. The write-up should clearly demonstrate your process and your results.
If you are working on this at the last minute, and you run into a syntax error, do not panic. Explain the purpose of the particular step you are working on. Do not assume that a syntax error in one step will prevent you from either solving or explaining a subsequent step.
Main Topics: Data Exploration & Data Visualization
Tasks:
● Data Exploration & Visualization:
1. Download the file ‘Austin_311_Public_Data.csv’ from our class Blackboard site. This dataset contains information about every city service request made in Austin, Texas for the last several years. (The PDF posted with the assignment explains the meaning of most of the variables in this dataset, and it will be helpful for a few steps here, too).
2. Read this file into your R environment (if it takes a while for the file to load, don’t worry -- this is normal. Be patient). Be sure to use the read.csv() function to import this dataset.
a. Call the str() function on your dataset, and show the results.
b. What does this function accomplish? How many rows and how many columns does your dataframe contain?
3. Filter your dataset, so that it only contains records with your assigned ZIP code (a list of all ZIP code assignments can be found on Blackboard).
a. How many records does your dataframe contain now?
4. Dealing with NA data.
a. Are there any NA values in your dataframe? How do you know this? What is the total number of NAs in the dataframe?
b. What percentage of the rows in the dataframe are complete cases? What is a complete case?
c. Convert any blank cells in the dataframe into NAs.
To accomplish that conversion, you may wish to run something like this:
df[df==""] <- NA
d. How many NAs are in the dataframe now?
e. Now, what percentage of rows in the dataframe are complete cases? Why did your answer from 3b and 3e difer – what happened?
f. Generate a table that shows the number of missing values and the percentage of missing values for each variable.
g. Remove any rows that have NA values for the column City.
h. How many rows of data do you have now?
5. Handling dates
a. Run the str() function to see how R views the Created.Date and Close.Date variables. What data type are they?
b. Using any method, convert each of these two variables to a ‘Date’ data type, and show that their type has been successfully converted. (Hint: you may wish to explore the anydate() function from the anytime package).
c. Now, add a new variable to the dataframe called duration. Duration should be based on the diference between Close.Date and Created.Date.
d. What is your birthday?
i. How many city service requests in Austin, TX were initiated on your birthday? What was the most common SR.Description for these requests? [Note: if you built a separate dataframe to answer 5d, just think of that dataframe as its own island – you won’t use it for any subsequent steps].
6. Exploring the dataset
a. Should ZIP Code be considered a numeric or categorical variable? Why?
b. What percentage of all the 311 city service requests in your dataframe came in through the Spot311 interface?
c. What percentage of all the 311 city service requests in your dataframe were made because of loose dogs?
d. Through how many unique types of methods did Austin receive service requests?
7. Remove the following column from the dataframe: Map.Page
8. Using the quarter() function from lubridate, create a new column called season. Season should be created from the Created.Date variable. Rename the quarters so that Quarter
1 becomes “Winter”, Quarter 2 becomes “Spring”, Quarter 3 becomes “Summer” and Quarter 4 becomes “Fall. ”
9. Using ggplot, construct a barplot showing the counts of city service requests during each of the four seasons. Fill your bars with any color of your choice.
a. What do you notice about your plot? Why might it look the way it does? (note: there is NO need for domain knowledge here – just take a moment to think about it, and answer with any reasonable speculation on your part).
10. Perform another filtering step. This time, filter your dataset so that only rows with the
6 most common SR.Description types remain.
a. How many rows does your dataframe contain now?
b. Using ggplot, make a barplot that depicts the counts for these six most common SR. Description types. Color your bars. Make sure that the axis labels are readable, and that the bars are ordered by size (this can be in increasing or decreasing order – either way is fine).
c. In a sentence or two, describe your plot – what does it show?
11. Time for more filtering! Now, filter the dataset so that only rows with the 6 most common types of Method.Received remain.
a. How many rows does your dataframe contain now?
b. Using facet_wrap, build faceted barplots now. Facet on SR.Description, using your barplots to show the totals for Method.Received.
i. What do you see here? In a sentence or two, describe what this faceted barplot shows, and point out anything that is noteworthy or unusual.
12. Now, make a histogram that depicts the distribution of the duration variable. Customize it any way that you wish to (number of bins, color, fill, axis limits, etc.)
a. Describe this histogram. What general pattern does it show?
13. Okay, so it’s time for one last filter operation. Filter the dataframe so that only the rows with the six most common streets remain.
a. Now, you will make a proportional fill barplot. Use Street as the variable to count on one axis, and use SR.Description as the “fill” variable. Inside your geom_bar() layer, write: position=”fill” to generate a proportional fill barplot.
b. What do you see now? What stands out here as interesting or unusual? Again, no domain knowledge is required – but write a couple of sentences of reasonable speculation that might explain some of the diferences that you see here.
** Install the leaflet package and call the library function on leaflet. **
14. Run the following line of code. Instead of saying “dataframe”, use the name of your dataframe. Replace the question marks with a reference to the dataframe, plus the columns that contain longitude and latitude.
m <- leaflet() %>% addTiles() %>% addCircles(lng= ? , lat= ?)
m # Print the map
Show a screenshot of your results.
15. Run something similar, but this time, select something of your choice after the dollar Sign. If you’re not sure what to choose, try a few things out and explore to find out what they do!
m <- leaflet() %>% addTiles() %>% addCircles(lng= ? , lat= ? ) %>%
addProviderTiles(providers$_____________)
m # Print the map
Show a screenshot of your results.
2022-11-16