闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AD699: Data Mining for Business Analytics

Individual Assignment #1

To submit, you will upload two ﬁles: The .R script that you used to store your coding steps, and a PDF of your write-up. You may wish to use a reporting tool called R Markdown, but this is not a requirement.

As always, remember to take advantage of your available resources: omce hours, recitation, the textbook, the video library, the Internet, etc. As the course slogan says, “Get After It!”

Throughout the assignment, you can assume that every ﬁltering step will “ﬂow” into the next steps.

Once you have ﬁltered the dataset for some particular purpose, you will use the ﬁltered version from that point forward. Every essential ﬁltering step is written in bold letters in the prompt.

There are no style points in 699. In other words, if your code accomplishes the stated objective, then it’s completely ﬁne to use it. You can use something from class material, the book, the web, or any other source (as long as it works!)

This prompt may not always refer to variables in the exact same way that the dataset does. This is very realistic. When in doubt, the names() function can be helpful for seeing variable names, including their quirks (capital letters, lowercase letters, punctuation marks, etc.)

Wherever you see a question, answer it using full sentences. The code needed for each step should also be shown in your write-up. The write-up should clearly demonstrate your process and your results.

If you are working on this at the last minute, and you run into a syntax error, do not panic. Explain the purpose of the particular step you are working on. Do not assume that a syntax error in one step will prevent you from either solving or explaining a subsequent step.

Main Topics: Data Exploration & Data Visualization

Tasks:

● Data Exploration & Visualization:

1. Download the ﬁle ‘Austin_311_Public_Data.csv’ from our class Blackboard site. This dataset contains information about every city service request made in Austin, Texas for the last several years. (The PDF posted with the assignment explains the meaning of most of the variables in this dataset, and it will be helpful for a few steps here, too).

2. Read this ﬁle into your R environment (if it takes a while for the ﬁle to load, don’t worry -- this is normal. Be patient). Be sure to use the read.csv() function to import this dataset.

a. Call the str() function on your dataset, and show the results.

b. What does this function accomplish? How many rows and how many columns does your dataframe contain?

3. Filter your dataset, so that it only contains records with your assigned ZIP code (a list of all ZIP code assignments can be found on Blackboard).

a. How many records does your dataframe contain now?

4. Dealing with NA data.

a. Are there any NA values in your dataframe? How do you know this? What is the total number of NAs in the dataframe?

b. What percentage of the rows in the dataframe are complete cases? What is a complete case?

c. Convert any blank cells in the dataframe into NAs.

To accomplish that conversion, you may wish to run something like this:

df[df==""] <- NA

d. How many NAs are in the dataframe now?

e. Now, what percentage of rows in the dataframe are complete cases? Why did your answer from 3b and 3e difer – what happened?

f. Generate a table that shows the number of missing values and the percentage of missing values for each variable.

g. Remove any rows that have NA values for the column City.

h. How many rows of data do you have now?

5. Handling dates

a. Run the str() function to see how R views the Created.Date and Close.Date variables. What data type are they?

b. Using any method, convert each of these two variables to a ‘Date’ data type, and show that their type has been successfully converted. (Hint: you may wish to explore the anydate() function from the anytime package).

c. Now, add a new variable to the dataframe called duration. Duration should be based on the diference between Close.Date and Created.Date.

d. What is your birthday?

i. How many city service requests in Austin, TX were initiated on your birthday? What was the most common SR.Description for these requests? [Note: if you built a separate dataframe to answer 5d, just think of that dataframe as its own island – you won’t use it for any subsequent steps].

6. Exploring the dataset

a. Should ZIP Code be considered a numeric or categorical variable? Why?

b. What percentage of all the 311 city service requests in your dataframe came in through the Spot311 interface?

c. What percentage of all the 311 city service requests in your dataframe were made because of loose dogs?

d. Through how many unique types of methods did Austin receive service requests?

7. Remove the following column from the dataframe: Map.Page

8. Using the quarter() function from lubridate, create a new column called season. Season should be created from the Created.Date variable. Rename the quarters so that Quarter

1 becomes “Winter”, Quarter 2 becomes “Spring”, Quarter 3 becomes “Summer” and Quarter 4 becomes “Fall. ”

9. Using ggplot, construct a barplot showing the counts of city service requests during each of the four seasons. Fill your bars with any color of your choice.

a. What do you notice about your plot? Why might it look the way it does? (note: there is NO need for domain knowledge here – just take a moment to think about it, and answer with any reasonable speculation on your part).

10. Perform another ﬁltering step. This time, ﬁlter your dataset so that only rows with the

6 most common SR.Description types remain.

a. How many rows does your dataframe contain now?

b. Using ggplot, make a barplot that depicts the counts for these six most common SR. Description types. Color your bars. Make sure that the axis labels are readable, and that the bars are ordered by size (this can be in increasing or decreasing order – either way is ﬁne).

c. In a sentence or two, describe your plot – what does it show?

11. Time for more ﬁltering! Now, ﬁlter the dataset so that only rows with the 6 most common types of Method.Received remain.

a. How many rows does your dataframe contain now?

b. Using facet_wrap, build faceted barplots now. Facet on SR.Description, using your barplots to show the totals for Method.Received.

i. What do you see here? In a sentence or two, describe what this faceted barplot shows, and point out anything that is noteworthy or unusual.

12. Now, make a histogram that depicts the distribution of the duration variable. Customize it any way that you wish to (number of bins, color, ﬁll, axis limits, etc.)

a. Describe this histogram. What general pattern does it show?

13. Okay, so it’s time for one last ﬁlter operation. Filter the dataframe so that only the rows with the six most common streets remain.

a. Now, you will make a proportional ﬁll barplot. Use Street as the variable to count on one axis, and use SR.Description as the “ﬁll” variable. Inside your geom_bar() layer, write: position=”ﬁll” to generate a proportional ﬁll barplot.

b. What do you see now? What stands out here as interesting or unusual? Again, no domain knowledge is required – but write a couple of sentences of reasonable speculation that might explain some of the diferences that you see here.

** Install the leaﬂet package and call the library function on leaﬂet. **

14. Run the following line of code. Instead of saying “dataframe”, use the name of your dataframe. Replace the question marks with a reference to the dataframe, plus the columns that contain longitude and latitude.

m <- leaflet() %>% addTiles() %>% addCircles(lng= ? , lat= ?)

m # Print the map

Show a screenshot of your results.

15. Run something similar, but this time, select something of your choice after the dollar Sign. If you’re not sure what to choose, try a few things out and explore to ﬁnd out what they do!

m <- leaflet() %>% addTiles() %>% addCircles(lng= ? , lat= ? ) %>%