FIT5196-S1-2021


Assessment 3

This is an individual assessment and worth 30% of your total mark for FIT5196.

Due date: Please check, Assessment 3: Data Integration and Reshaping

In this assessment, you must write Python code to integrate several datasets into one single schema and find and fix possible problems in the data. The input and output of this assessment are shown below:

Table 1. The input and output of the task Inputs Output

Inputs
Outputs
Jupyter-Notebook & pdf
<Student_ID>.zip,
Vic_suburb_boundary.zip,
gtfs.zip
<Student_ID>_A3_solution.csv
<Student_ID>_ass3.ipynb,
<Student_ID>_ass3.pdf

Note: A single zip file with CSV and IPYNB and PDF file is to be submitted.

The pdf file should be generated from your jupyter notebook file (after clearing all the cells output), and it will be used for plagiarism checks via Turnitin.

Each of you is given seven (7) datasets in various formats, and the data is about housing information in Victoria, Australia. You can find your dataset here. In this assignment, you need to perform the following tasks.


Task 1: Data Integration (60%)

In this task, you must integrate the input datasets (i.e., seven datasets including hospitals, schoolRecreational activity areas, real estate files (one XML and one CSV), Vic_suburb_boundaryand gtfs) into one dataset with the following schema.

Table 2. Description of the final schema

  Column
  Description
  Property_id
  A unique id for the property
  lat
  The property latitude
  lng
  The property longitude
  addr_street
  The property address
  suburb (21%)
  The property suburb.
  price
  The property price
  property_type
  The type of the property
  year
  Year of sold
  bedrooms
  Number of bedrooms
  bathrooms
  Number of bathrooms
  parking_space
  The number of parking space on the property
  School_id (5%)
  The closest school to the property.
  Distance_to_school (1%)
  The distance from the closest school to the property.
  Train_station_id (10%)
  The closest train station to the property.
  Distance_to_train_station (1%)
  The distance from the closest train station to the property.
  travel_min_to_CBD (25%)
  The average travel time (minutes) from the closest train
  station to the “Southern Cross Station” station on weekdays
  (i.e., Monday-Friday) departing between 7 to 9 am. For
  example, if three (3) trips are departing from the closest train
  station to the Southern Cross station on weekdays between
  7-9 am, and each takes 6, 7, and 8 minutes respectively, then
  the value of this column for the property should be
  (6+7+8)/3.
  Transfer_flag (25%)
  A Boolean attribute indicates a direct trip to the Southern
  Cross station from the closest station between 7-9 am on the
  weekdays. This flag is 0 if there is a direct trip (i.e., no
  transfer between trains is required to get from the closest
  train station to the Southern Cross station) and one (1)
  otherwise.
  Hospital_id (5%)
  The closest hospital to the property.
  Distance_to_hospital (1%)
  The distance from the closest hospital to the property.
  Recreation_centre_id (5%)
  The closest recreation activity centre to the property.
  Distance_to_Recreation_centre (1%)
  The distance from the closest recreation activity centre to
  the property.


Task 2: data reshaping (20%)

In this task, you need to study the effect of different normalization/transformation methods (i.e., standardization, min-max normalization, log, power, box-cox transformation) on the “price”, “Distance_to_school”, “travel_min_to_CBD”, and “Distance_to_Recreation_centre” attributes.

Further, observe and explain their effect, assuming we want to develop a linear model to predict the “price” using “Distance_to_school”, “travel_min_to_CBD”, and “Distance_to_Recreation_centre” attributes. The linear regression assumptions that you need to study in this task are Normality and Linearity.


Task 3: Documentation (20%)

The main focus of the documentation would be on the quality of your explanation on task 2 but similar to the previous assignments. Your notebook file should be in a proper format with appropriate sections and subsections.

Notes:

1. The output CSV file must have the same columns as specified on the schema. Please note that the output files which are not in the correct format, as defined in the integrated schema, won’t be marked.

2. If you decide not to calculate any of the required columns, you must have that column in your final data frame with the ‘Null’ as the value of all the rows. Please note that the output files which are not in the correct format, as specified in the integrated schema, won’t be marked.

3. No external data is allowed to calculate the values of the integrated schema. For example, to calculate the suburb, you can only use the provided shapefiles.

4. The radius of the earth is still 6371 km!

5. In table 2, numbers in front of some of the columns in the format of (a%) are the allocated mark associated with that column. For example, column “suburb” carries 21% of the total output mark of task 1.

6. For transfer_flag column, if your answer is incorrect, negative mark will be awarded. For e.g., if a you got 50% of transfer_flag correct and the other 50% are incorrect, then scores is zero (0).