Data Wrangling
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Mid-term assessment
Data Wrangling (Preprocessing)
|
Assessment type: |
Written report (PDF document) using R Markdown |
|
Due date: |
19th Sep at 17:00, Melbourne time |
|
Weighting: |
35% |
|
Word limit: |
Maximum 25 pages |
|
Feedback mode: |
Feedback will be provided using Canvas marking tools and general text comments. |
You will work on this assessment individually. Overview
This assessment allows you to apply the data preprocessing knowledge and skills learned in Modules 1-5.
To complete this assessment, imagine yourself in the role of data analyst at one of the following brands, or another brand that interests you:
Woolworths, a major Australian grocery chain (background)
Bunnings, a major hardware retailer (background)
Uber, a market leader in the ride-sharing sector (background)
A successful assessment requires you to generate and preprocess realistic synthetic data relevant to this brand. You will then create a report using R Markdown to explain the steps taken by you to perform all tasks and perform some summary analysis of the data.
It is important that the data you create appears to be realistic. Real data can be random and messy, with missing values and outliers. Therefore, you will need to insert some messiness into the data you generated. In addition, as data are often drawn from several sources, you will need to generate and combine at least two synthetic datasets.
Assessment criteria and weighting
Please see the marking rubric to know the assessment criteria and weightage.
Assessment process
This assessment is divided into four parts:
• Generating synthetic data sets
• Combining and pre-processing data sets
• Creating your report using R Markdown
• Recording a 3-5 minutes video presentation
Assessment Instructions
Use the given R Markdown template to create the report. In your report, you must explain what you do in each step below, including R codes used and their outputs.
Step 1. Generate the data. Ensure you meet the following minimum requirements:
• At least two synthetic data sets should be created, each with at least 100 rows, and 5-10 variables.
• The synthetic data sets should include a common variable(s).
• Each synthetic data set should include multiple data types (numeric, character, factor, etc).
• At least one of the synthetic data sets must include correlated random data.
• In each synthetic data set, at least 1 column must contain missing values (approximately 5% values of rows).
Step 2. Merge your synthetic data sets. Your resulting data set should include:
• Multiple data types (numeric, character, factor, date etc.)
• Variables suitable for the data type conversions (e.g., from character/numeric to factor)
• At least one factor variable that needs to be labelled and/or ordered.
Step 3. Check the structure of combined data set and perform all necessary data type conversions.
Step 4. Group by one of the categorical variables, and then generate summary statistics for one of the numeric variables. The summary statistics should include the mean, median, first and third quartiles, and the standard deviation.
Step 5. Scan all variables for missing values. Use a suitable technique to deal with the missing values.
Step 6. Record a 3-5 minute video presentation that steps through the main points of your report. Some points which you may consider covering are steps taken to:
. Creating data sets
. Checking the structure of data sets and types of variables
. Combining/merging data sets
. Performing data type conversion
. Generating summary statistics
. Scanning data and dealing with the missing values
Important Note:
● You must provide the R codes with outputs and explain everything that you do in each step. Failure to do this would result in a reduction in the mark. Check the report sections below and the marking rubric for more information.
Create the report using R Markdown
The assessment report must be completed using the R Markdown template provided here:
Note that this is an R Markdown notebook template. Information for using the R Markdown package can be found here. The R Markdown template must be updated with your name(s) and student number(s). You must use the headings and chunks provided in the template. You can add more chunks if required. Your report will be composed of the following sections.
Sections of the report:
Student’ details [YAML input]: Add student’s full name and numbers.
1. Data generation [Plain text & R code & Output]: In this section, you must provide the R codes with outputs (i.e., head of data set) and explain everything that you do to generate your datasets.
2. Merge your synthetic data sets [Plain text & R code & Output]: Describe combining your data sets using a proper function from Module 4. Provide the R codes with outputs and explain everything that you do in this step.
3. Check the structure of combined data [Plain text & R code & Output]: Check the structure of combined data and perform all necessary data type conversions using skills from Module 3. Provide the R codes with outputs and explain everything that you do in this step.
4. Generate summary statistics [Plain text & R code & Output]: Group by one of the categorical variables, and then generate summary statistics for one of the numeric variables. The summary statistics should include the mean, median, first and third quartiles, and the standard deviation. Provide the R codes with outputs and explain everything that you do in this step.
5. Scan variables for missing values [Plain text & R code & Output]: Scan all variables for missing values and use a suitable technique to deal with the missing values using skills from Module 5. Provide the R codes with outputs and explain everything that you do in this step.
Submission Format
● Upload the report as one single file (PDF) via the assessment page in CANVAS.
● The easiest way to produce a PDF file from the RMarkdown is to Run all R chunks, then Preview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) → Right-click on the report in Chrome → Click Print and Select the Destination Option to Save as PDF.
● After creating your PDF file make sure and check that your codes and outputs are visible.
Referencing guidelines
Use RMIT Harvard referencing style for this assessment. You must acknowledge all the sources of information you have used in your assessments. Refer to the RMIT Easy Cite referencing toolto see examples and tips on how to reference in the appropriate style. You can also refer to thelibrary referencing pagefor more tools such as EndNote, referencing tutorials, and referencing guides for printing. Use the RMIT Harvard referencing method for this assessment.
Collaboration
You are permitted to discuss and collaborate on the assessment with other groups. However, the write-up of the report must be with your own allocated group effort. Assignments will be submitted through Turnitin, so if you’ve copied from other groups, it will be detected. It is your responsibility to ensure you do not copy or do not allow another group to copy your work. If plagiarism is detected, both groups will be responsible. It is good practice to never share assessment files with others. You should ensure you understand your responsibilities by reading the RMIT University website on academic integrity. Ignorance is no excuse.
Academic integrity and plagiarism
Academic integrity is about the honest presentation of your academic work. It means acknowledging
the work of others while developing your own insights, knowledge, and ideas.
You should take extreme care that you have:
● Acknowledged words, data, diagrams, models, frameworks, and/or ideas of others you have quoted (i.e., directly copied), summarised, paraphrased, discussed, or mentioned in your assessment through the appropriate referencing methods.
● Provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken from internet sites.
If you do not acknowledge the sources of your material, you may be accused of plagiarism because you have passed off the work and ideas of another person, without appropriate referencing, as if they were your own.
RMIT University treats plagiarism as a very serious offense constituting misconduct. Plagiarism covers a variety of inappropriate behaviours, including:
● Failure to properly document a source.
● Copyright material from the internet or databases.
● Collusion between students.
For further information on our policies and procedures, please refer to the University website.
Assessment Declaration
When you submit work electronically, you agree to theAssessment Declaration.
Extensions and Special Consideration
This course follows the RMIT University Assessment policy for extensions and special consideration. Information is available here. Ensure you understand these guidelines before applying.
Extensions will only be granted in accordance with the RMIT University Extension and Special Consideration Policy. No exceptions. Assessments submitted late will be penalized (see below for further details).
Late Submission of Assessment
Late submissions, without an approved extension or special consideration, will incur a penalty of 10% of the total mark per day for up to 5 days late (so the maximum late penalty is 50%). Submissions more than 5 days late are not accepted.
Penalty for Exceeding Maximum Number of 25 Pages
A penalty of 5% of the total mark will be applied per each extra page.
Assessment Marking Rubric
Criteria |
To meet all requirements and get the full point, you must complete the following criteria for each part: |
Create realistic-looking synthetic data sets (20%) |
• Realistic-looking data sets supported by coding (both random and correlated data included). • Data sets met the minimum requirements. • Complete and clear description of synthetic data sets were provided. • R codes and the outputs were given. |
Merge your synthetic data sets (15%) |
• Synthetic data sets successfully merged. • Clear and comprehensive explanation of coding provided. • R codes and the outputs were given. |
Check structure of combined data (15%) |
Complete inspection of data set and variables was given including: • Dimensions of the data frame were given. • Types of variables (i.e., character, numeric, integer, factor, and logical) were given. • If variables were not in the correct data type, proper type conversions were applied. • Levels of factor variables were checked, and they were renamed/rearranged if required. • R codes and the outputs were given. |
Generate summary statistics (15%) |
• Summary statistics were grouped by a categorical variable and correctly generated. • Observations/insights provided into summary statistics. • R codes and the outputs were given. |
Scan variables for missing values (15%) |
• A reliable method was used to scan data for, and deal with missing values. • Clear and comprehensive explanation of methods provided. • R codes and the outputs were given. |
Overall quality of report and video presentation (3-5 min) (20%) |
• Professional looking report knitted to PDF Document. • R Markdown template is used. • Comprehensive and concise report with coherent structure, content and formatting. |
2023-09-18
Written report (PDF document) using R Markdown