闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

AD699: Data Mining for Business Analytics

Individual Assignment #1

Spring 2023

Due by: Friday, 17FEB @ 11:59 p.m.

To submit, you will upload two ﬁles: The .R script that you used to store your coding steps, and a PDF of your write-up. You may wish to use a reporting tool called R Markdown, but this is not a requirement.

As always, remember to take advantage of your available resources: omce hours, recitation, the textbook, the video library, the Internet, etc. As the course slogan says, “Get After It!”

Throughout the assignment, you can assume that every ﬁltering step will “ﬂow” into the next steps.

Once you have ﬁltered the dataset for some particular purpose, you will use the ﬁltered version from that point forward. Every essential ﬁltering step is written in bold letters in the prompt.

There are no style points in 699. In other words, if your code accomplishes the stated objective, then it’s completely ﬁne to use it. You can use something from class material, the book, the web, or any other source (as long as it works!)

This prompt may not always refer to variables in the exact same way that the dataset does. This is very realistic. When in doubt, the names() function can be helpful for seeing variable names, including their quirks (capital letters, lowercase letters, punctuation marks, etc.)

Wherever you see a question, answer it using full sentences. The code needed for each step should also be shown in your write-up. The write-up should clearly demonstrate your process and your results.

If you are working on this at the last minute, and you run into a syntax error, do not panic. Explain the purpose of the particular step you are working on. Do not assume that a syntax error in one step will prevent you from either solving or explaining a subsequent step.

Main Topics: Data Exploration & Data Visualization

Tasks:

● Data Exploration & Visualization:

1. Download the ﬁle ‘apartments_toronto.csv’ from our class Blackboard site. This dataset contains information about apartment building inspections conducted in the city of Toronto, Ontario. A separate ﬁle with a dataset description will be posted to Blackboard. You can also ﬁnd a dataset description here:

https://open.toronto.ca/dataset/apartment-building-evaluation/

2. Read this ﬁle into your R environment (if it takes a while for the ﬁle to load, don’t worry -- this is normal. Be patient). Be sure to use the read.csv() function to import this dataset.

a. Call the str() function on your dataset, and show the results.

b. What does this function accomplish? How many rows and how many columns does your dataframe contain?

3. Filter your dataset, so that it only contains records with your assigned ward (a list of all ward name assignments can be found on Blackboard).

a. How many records does your dataframe contain now?

4. Dealing with NA data.

a. Are there any NA values in your dataframe? How do you know this? What is the total number of NAs in the dataframe?

b. Generate a table that shows the number of missing values and the percentage of missing values for each variable. Which variables have missing values? Pick any three columns with missing values – in a sentence or two for each one, explain why the column might have missing values. (Note: There is no domain knowledge needed to answer this question – any thoughtful explanation here is ﬁne). Remember to consult the dataset description. You may wish to use a function from the naniar package for this step, but you are not required to.

c. For any column(s) whose values are more than 50% missing, remove the column(s) entirely.

5. Handling dates

a. Which column in this dataset contains dates? Run the str() function to see how R views this variable. What data type is it seen as?

b. Using any method, convert this variable to a ‘Date’ data type, and show that its type has been successfully converted. (Note: Be careful to pay attention to the particular way the date is written – if you don’t, this will not come out the way you want it to).

c. What is your birth month? (just answer here with the month – you won’t use the day or year).

i. How many Toronto building inspections in your ward were made during your birth month?

6. Exploring the dataset

a. Should “Ward” be considered a numeric or categorical variable? Why?

b. Average storeys

1. What is the median number of conﬁrmed storeys for the buildings in your ward?

2. What is the mean number of conﬁrmed storeys for the buildings in your ward?

3. Write a sentence or two that could help to explain the diference (or similarity) between the two average storey values that you just found. What might explain this?

c. What percentage of all the buildings in your ward received a result of “Evaluation needs to be conducted in 3 years”?

d. What is the oldest building in your ward? What overall evaluation score did it earn? (Note: If several buildings are tied for ‘oldest’ because they were built in the same year, you can pick any of the ones from that year to answer this).

7. Using the quarter() function from lubridate, create a new column called season. Season should be created from the Evaluation_Completed_On variable. Next, rename the quarters so that Quarter 1 becomes “Winter”, Quarter 2 becomes “Spring”, Quarter 3 becomes “Summer” and Quarter 4 becomes “Fall. ”

8. Using ggplot, construct a barplot showing the counts of completed evaluations during each of the four seasons. Fill your bars with any color of your choice.

a. What do you notice about your plot? Why might it look the way it does? (note: there is NO need for domain knowledge here – just take a moment to think about it, and answer with any reasonable speculation on your part).

9. Again using ggplot, let’s make another barplot. This time, place the property types on the x-axis. On the y-axis, show the mean scores for gramti ratings for each property type.

a. What does this plot show you? Write 1-2 sentences that speculate about potential reasons for the diferent ratings among the property types. (remember that 5 is the ‘best’, or cleanest, gramti score).

10. Using ggplot, make a histogram that shows the distribution of the ‘SCORE’ variable. Use any number of bins, and stylize it with any color/ﬁll values of your choice.

a. In a sentence or two, describe this histogram. What does it show about the scores?

11. Using ggplot, make a histogram that shows the distribution of the ‘GRAFFITI’ variable.

b. In a sentence or two, describe your plot – what does it show?

12. Now, generate faceted histograms. These histograms should depict the distribution of the SCORE variable, faceted on the RESULTS_OF_SCORE variable. Fill your histograms with any color of your choice, and use whatever number of bins that you wish to use.

i. What do you see here? In a few sentences, describe what these faceted histograms show. What connection can you make between the score values and the ‘results of score’ outcomes?

13. Okay, so it’s time for one more ﬁlter operation. Filter the dataframe so that only the ﬁve most common streets from your ward remain. You can approach this any way that you would like to – but you may wish to ﬁrst split the SITE_ADDRESS variable into separate columns to make this easier.

a. Filter your dataframe so that it only contains properties from the ﬁve most common streets in your ward.

b. Now, build a scatterplot with this newly-ﬁltered version of your data. Place YEAR_BUILT along your x-axis. Place SCORE on your y-axis, and use diferent colors for each street name. What do you see here? Are there any patterns or interesting takeaways from this graph?

** Install the leaﬂet package and call the library function on leaﬂet. **

14. Next, you will see your Toronto ward in map format, using a package known as leaﬂet.

Run the following line of code. For the longitude and latitude parameters for the addCircles() function, you will need to specify that you wish to use the longitude and latitude values from your dataframe:

m <- leaflet() %>% addTiles() %>% addCircles(lng= ? , lat= ?)

m # Print the map

Show a screenshot of your results.

15. Run something similar, but this time, select something of your choice after the dollar sign. If you’re not sure what to choose, try a few things out and explore to ﬁnd out what they do!

m <- leaflet() %>% addTiles() %>% addCircles(lng= ? , lat= ? ) %>%