Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Mid-term assessment

Data Wrangling (Preprocessing)

Assessment type:

Written report (PDF document) using R Markdown

Due date:

19th Sep at 17:00, Melbourne time

Weighting:

35%

Word limit:

Maximum 25 pages

Feedback mode:

Feedback will be provided using Canvas marking tools and general text comments.

You will work on this assessment individually. Overview

This assessment allows you to apply the data preprocessing knowledge and skills learned in Modules 1-5.

To complete this assessment, imagine yourself in the role of data analyst at one of the following brands, or another brand that interests you:

Woolworths, a major Australian grocery chain (background)

Bunnings, a major hardware retailer (background)

Uber, a market leader in the ride-sharing sector (background)

A successful assessment requires you to generate and preprocess realistic synthetic data relevant to this  brand. You will then create a report using  R  Markdown to explain the steps taken  by you to perform all tasks and perform some summary analysis of the data.

It is important that the data you create appears to be realistic. Real data can be random and messy, with missing values and outliers. Therefore, you will need to insert some messiness into the data you generated. In addition, as data are often drawn from several sources, you will need to generate and combine at least two synthetic datasets.

Assessment criteria and weighting

Please see the marking rubric to know the assessment criteria and weightage.

Assessment process

This assessment is divided into four parts:

•     Generating synthetic data sets

•     Combining and pre-processing data sets

•     Creating your report using R Markdown

•     Recording a 3-5 minutes video presentation

Assessment Instructions

Use the given R Markdown template to create the report. In your report, you must explain what you do in each step below, including R codes used and their outputs.

Step 1. Generate the data. Ensure you meet the following minimum requirements:

•     At least two synthetic data sets should be created, each with at least 100 rows, and 5-10 variables.

•     The synthetic data sets should include a common variable(s).

•     Each synthetic data set should include multiple data types (numeric, character, factor, etc).

•     At least one of the synthetic data sets must include correlated random data.

•     In each synthetic data set, at least 1 column must contain missing values (approximately 5% values of rows).

Step 2. Merge your synthetic data sets. Your resulting data set should include:

•     Multiple data types (numeric, character, factor, date etc.)

Variables suitable for the data type conversions (e.g., from character/numeric to factor)

At least one factor variable that needs to be labelled and/or ordered.

Step 3. Check the structure of combined data set and perform all necessary data type conversions.

Step 4. Group by one of the categorical variables, and then generate summary statistics for one of the numeric variables. The summary statistics should include the mean, median, first and third quartiles, and the standard deviation.

Step 5. Scan all variables for missing values. Use a suitable technique to deal with the missing values.

Step 6. Record a 3-5 minute video presentation that steps through the main points of your report. Some points which you may consider covering are steps taken to:

.     Creating data sets

.     Checking the structure of data sets and types of variables

.     Combining/merging data sets

.     Performing data type conversion

.     Generating summary statistics

.     Scanning data and dealing with the missing values

Important Note:

●    You must provide the R codes with outputs and explain everything that you do in each step. Failure to do this would result in a reduction in the mark. Check the report sections below and the marking rubric for more information.

Create the report using R Markdown

The assessment report must be completed using the R Markdown template provided here:

R Markdown Template

Note that this is an R Markdown notebook template. Information for using the R Markdown package can  be found here. The  R  Markdown  template  must  be  updated  with your  name(s)  and student number(s). You must use the headings and chunks provided in the template. You can add more chunks if required. Your report will be composed of the following sections.

Sections of the report:

Student’ details [YAML input]: Add student’s full name and numbers.

1. Data generation [Plain text & R code & Output]: In this section, you must provide the R codes with outputs (i.e.,  head of data set) and explain everything that you do to generate your datasets.

2. Merge your synthetic data sets [Plain text & R code & Output]: Describe combining your data sets using a proper function from Module 4. Provide the R codes with outputs and explain everything that you do in this step.

3. Check the structure of combined data [Plain text & R code & Output]: Check the structure of combined data and perform all necessary data type conversions using skills from Module 3. Provide the R codes with outputs and explain everything that you do in this step.

4. Generate summary statistics [Plain text & R code & Output]: Group by one of the categorical variables,  and  then  generate  summary  statistics  for  one  of  the  numeric  variables.  The summary  statistics  should  include  the  mean,  median,  first  and  third  quartiles,  and  the standard deviation. Provide the R codes with outputs and explain everything that you do in this step.

5. Scan variables for missing values [Plain text & R code & Output]: Scan all variables for missing values and use a suitable technique to deal with the missing values using skills from Module 5. Provide the R codes with outputs and explain everything that you do in this step.

Submission Format

●     Upload the report as one single file (PDF) via the assessment page in CANVAS.

●    The easiest way to produce a PDF file from the RMarkdown is to Run all R chunks, then Preview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) → Right-click on the report in Chrome → Click Print and Select the Destination Option to Save as PDF.

●    After creating your PDF file make sure and check that your codes and outputs are visible.

Referencing guidelines

Use RMIT Harvard referencing style for this assessment. You must acknowledge all the sources of information you have used in your assessments. Refer to the RMIT Easy Cite referencing toolto see examples and tips on how to reference in the appropriate style. You can also refer to thelibrary referencing pagefor more tools such as EndNote, referencing tutorials, and referencing guides for printing. Use the RMIT Harvard referencing method for this assessment.

Collaboration

You are permitted to discuss and collaborate on the assessment with other groups. However, the write-up of the report must be with your own allocated group effort. Assignments will be submitted through Turnitin, so if you’ve copied from other groups, it will be detected. It is your responsibility to ensure you do not copy or do not allow another group to copy your work. If plagiarism is detected, both groups will be responsible. It is good practice to never share assessment files with others. You should  ensure  you  understand  your  responsibilities  by  reading  the  RMIT  University  website  on academic integrity. Ignorance is no excuse.

Academic integrity and plagiarism

Academic integrity is about the honest presentation of your academic work. It means acknowledging

the work of others while developing your own insights, knowledge, and ideas.

You should take extreme care that you have:

●    Acknowledged words, data, diagrams, models, frameworks, and/or ideas of others you have quoted  (i.e.,  directly  copied),  summarised,  paraphrased,  discussed,  or  mentioned  in  your assessment through the appropriate referencing methods.

●     Provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken from internet sites.

If you do not acknowledge the sources of your material, you may be accused of plagiarism because you have passed off the work and ideas of another person, without appropriate referencing, as if they were your own.

RMIT University treats plagiarism as a very serious offense constituting misconduct. Plagiarism covers a variety of inappropriate behaviours, including:

●      Failure to properly document a source.

●      Copyright material from the internet or databases.

●      Collusion between students.

For further information on our policies and procedures, please refer to the University website.

Assessment Declaration

When you submit work electronically, you agree to theAssessment Declaration.

Extensions and Special Consideration

This course follows the RMIT University Assessment policy for extensions and special consideration. Information is available here. Ensure you understand these guidelines before applying.

Extensions  will  only  be  granted  in  accordance  with  the RMIT   University Extension  and Special Consideration Policy. No exceptions. Assessments submitted late will  be  penalized (see below for further details).

Late Submission of Assessment

Late submissions, without an approved extension or special consideration, will incur a penalty of 10% of the total mark per day for up to 5 days late (so the maximum late penalty is 50%). Submissions more than 5 days late are not accepted.

Penalty for Exceeding Maximum Number of 25 Pages

A penalty of 5% of the total mark will be applied per each extra page.

Assessment Marking Rubric

Criteria

To meet all requirements and get the full point, you must complete the

following criteria for each part:

Create realistic-looking

synthetic data sets (20%)

•     Realistic-looking data sets supported by coding (both random and correlated data included).

Data sets met the minimum requirements.

•     Complete and clear description of synthetic data sets were provided.

R codes and the outputs were given.

Merge your synthetic data sets (15%)

Synthetic data sets successfully merged.

•     Clear and comprehensive explanation of coding provided.

R codes and the outputs were given.

Check structure of

combined data (15%)

Complete inspection of data set and variables was given including:

Dimensions of the data frame were given.

Types of variables (i.e., character, numeric, integer, factor, and logical) were given.

If variables were not in the correct data type, proper type conversions were applied.

•     Levels of factor variables were checked, and they were renamed/rearranged if required.

R codes and the outputs were given.

Generate summary

statistics (15%)

•     Summary statistics were grouped by a categorical variable and correctly generated.

•     Observations/insights provided into summary statistics.

R codes and the outputs were given.

Scan variables for

missing values (15%)

A reliable method was used to scan data for, and deal with missing values.

•     Clear and comprehensive explanation of methods provided.

R codes and the outputs were given.

Overall quality of report and video presentation (3-5 min) (20%)

Professional looking report knitted to PDF Document.

R Markdown template is used.

•     Comprehensive and concise report with coherent structure, content and formatting.