Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Introduction to Data Science 11372 & 11516

Assignment 1 - Data Wrangling

Motivation

Data Description

Copyright of Data

Tasks

  Part A - Reading (20 Marks)

  Part B - Preparing Data for Analysis (20 marks)

Deliverables

Motivation

The purpose of this assignment is to assess your skills on reading data files into a single data frame, applying different cleaning and wrangling steps getting the data ready for the modelling.

Data Description

The observations in the attached CSV files have been taken from the Bureau of Meteorology’s real time”   system. These observations provide some details about the weather in the Australian Capital territory for 39 months. Most of the data are generated automatically. Some quality checking has been performed, but it is still possible for erroneous values to appear. Sometimes, when the daily maximum and minimum                  temperatures, rainfall or evaporation are missing, the next value given has been accumulated over several  days rather than the normal one day.

There are 39 comma-separated data files provided with this assignment. These data are for the months from August 2018 to February 2022, (with some months were skipped) . The variables reported in each file are described in Table 1.

 

Table 1 - Column Meanings.

Copyright of Data

Copyright of Bureau of Meteorology materials resides with the Commonwealth of Australia. Apart from any fair dealing for purposes of study, research, criticism and review, as permitted under copyright legislation, no part  of this product may be reproduced, re-used or redistributed for any commercial purpose whatsoever, or           distributed to a third party for such purpose, without written permission from the Director of Meteorology.

Tasks

The following sections have tasks that must be attempted and reported on in your submission. You will provide a  .rmd file that includes aspects of your formal reporting (e.g., an introduction, assumptions) using markdown syntax, mixed with your R Code and results.

Note: Marking (5 Marks) is included for overall presentation of the submission, including coding style (for examples see Googles R Style Guide (https://google.github.io/styleguide/Rguide.html) )

Part A - Reading (20 Marks)

There are 39 CSV files that need to be imported into a dataframe (or tibble) for further analysis.

1. Note that the data for the months of Feb-2018 to Feb-2020 and from Sep-2021 to Feb-2022 have      different date format compared to the data of months Jul-2020 to Aug-2021. Update the sample code below to prepare and parse the two different date formats (2 marks)

# You should notice in the data that we have two different formats Australian (ie, day/mo

nth/year), and US one (ie, month/day/year). Please pass sample dates from the data (from any files) to the following two variables. Then, you will see that the R compiler turns both of them into the same format. Doing that, we guarantee putting the dates into consi

stent format.

AU_Date_Format <-

 

str(as.Date("Enter some date", format=AU_Date_Format))

str(as.Date("Enter some other date", format=AU_Date_Format))

 

US_Date_Format <-

 

str(as.Date("Enter some date", format=US_Date_Format))

str(as.Date("Enter some other date", format=AU_Date_Format))

2. Correct the following code to loop through your working directory and concatenate the files into one dataframe (10 marks)

setwd("change this to the path of your working directory")

files <- list.files(".","data/*.csv")

act_weather_data <- data_frame

 

for (i in 1:length(files)) {

data <- read_csv(files, show_col_types = FALSE)

assertthat : :assert_that(nrow(problems(data)) == 0,

msg="There is still problem/s, which you need to fix first")

 

temp <- tryCatch (

expr    = {  parse_date(data$Date, AU_Date_Format)},

warn ing = funct ion (e) { parse_date(data$Date, AU_Date_Format)})

 

 

Date <- format(temp, AU_Date_Format)

 

act_weather_data <- rbind(act_weather_data, data)

}

 

# Clean up any temporary variables no longer needed

rm(file, data, temp, i)

Note: In the above code, we provide you with a way to combine the files into a single data frame, however, will need to fix the syntax errors in this code to get the code working. Also, you can write your own way from          scratch to combine the files into a single data frame.

3. Run the following code to demonstrate the loading of the data has been executed correctly. Provide an explanation in words what each line does (8 marks):

dim(act_weather_data)

str(act_weather_data)

 

act_weather_data %>% group_by(Date) %>% summarise(count=n()) %>%

summarise(max = max(count))

 

act_weather_data %>% summarise_all(funs(sum(is.na(.)))) %>%

gather() %>% filter(value > 0)

Part B - Preparing Data for Analysis (20 marks)

Write code to do the following tasks:

1. Remove the variables, which have no data at all (i.e. all the records in these variables are NAs) (3 marks)

2. Drop the variables, which have few data (i.e. NAs values are more than 90% of number of records in these variables). (2 marks)

3. Change the column names to have no spaces between the words and replace these spaces with underscore the  _ character. (2 marks)

4. Change the type of the column called  Date from character to Date data type. (2 marks)

5. Add two new columns for the month and year of the data in each file, you may extract the contents of

this column from the  Date column. Please note that the data are collected for 19 months across 3 years (2018, 2019, 2020, etc.). (4 marks)

6. Change the type of the  Month and  Year columns from Character to Ordinal with levels as the number of months in a year (i.e. 12) and number of years (5). (3 marks)

7. For all the numeric columns, replace the remaining NAs with the median of the values in the column, if exist. (4 marks)

8. Save the weather data to a single file for submission (2 marks)

Deliverables

You are required to submit a compressed (e.g. ZIP) file to Canvas with the following files:

1. Single  .rmd file with the markdown report & code for Tasks: Part A and Part B

2. An HTML or PDF document generated by knitting your  .rmd file,

3. The  act_weather_part_A.csv file created at the end of Part A, and

4. The  act_weather_part_B.csv file created at the end of Part B.

Please follow the following structure to name the submitted zip file:

[studentID_lastname_assignment1.zip]

Replace the  studentID with your university ID and  lastname with your name.