Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DTS002TC Essentials of Big Data

Coursework 2 (Individual Assessment)

Due: 5 pmChinatime (UTC+8 Beijing) on Fri 26th May2023

Weight: 50%

Maximum scor e: 100 marks ( 100 individual marks)

Assessed lear ning outcomes:

E . Demonstr ate the ability to write code to obtain numerical solutions to mathematical pr oblems.

F . Demonstr ate the ability to display computational r esults in tabulated or gr aphical for ms.

Assessment tasks:

Overview

Employee Attrition analysis is a very typical big data prediction and analysis problem. You need to use the MATLAB programming technology you have learnt to complete the  specified data analysis and display tasks.

Submission format instructions

You need to submit two documents, EmployeeAttrition.m , ‘Attrition_result.csv’ and ID_report.pdf via Learning Mall Online to the correct drop box.

EmployeeAttrition.m : includes the Matlab code that can be run directly.

Attrition_result.csv: Blind test output.

ID_report.pdf : includes the Matlab code, outputs, and figures in your report.

Data introduction

You need to download the raw data set named Attrition train.csvand Attrition test.csv from LMO.

The data set includes 13 columns of house data. The meaning of data in each row is as follows:

Column Index

Value name

Instance

1

Age

41

2

Department

Sales

3

DistanceFromHome

1

4

Education

2

5

EducationField

Life Sciences

6

EnvironmentSatisfaction

2

7

JobSatisfaction

4

8

MaritalStatus

Single

9

MonthlyIncome

5993

10

NumCompaniesWorked

8

11

WorkLifeBalance

1

12

YearsAtCompany

6

13

Attrition-

Yes

Tasks 1 data analysis (40 Marks)

In big data analysis, we will first load a large number of historical data, then conduct simple data analysis through simple statistics and comparison. Please complete the following steps with big data statistics and analysis thinking so that you view the results of each step with MATLAB.

1. Data preparation: read in the Attrition_train.csv file from the current folder and assign each columns data to a variable. (5 marks)

2. Salary analysis: calculate the average “Income” of all the employees of the company. (5 marks)

3. Seniority analysis: calculate the maximum value of YearsAtCompany with different education.

(5 marks)

4. Department Population Analysis: calculates the number of persons in each department. (5 marks)

5. Department seniority analysis: calculate the average WorkLifeBalance of each department. (5 marks)

6. Department commuting analysis: calculate the average DistanceFromHome of each department. (5 marks)

7. Departmental Attrition Status: calculate the attrition rate of each department. (5 marks)

8. Result Plot: use bar function plot the results of steps 4-7 in a 2 * 2 subgraph and save the image as fig.png”. (5 marks)


Task 2 Main reason prediction (30 Marks)

In human resources analysis, the traditional model aims to find out the biggest reason for personnel attrition. For example, some researchers believe that salary is the main reason for attrition, while others believe that the work environment may be the main reason. Please complete the following steps to find the possible main reason for the attrition of the company.

1. Data preparation: Select 80% of the age data as the training input data in sequence and the remaining 20% as the test input data. Similarly, 80% of the attrition data should be selected as the training output data, and the remaining 20% should be chosen as the test output data. (5 marks)

2. Model training: using the training input data and training output data for model training with a linear regression algorithm. (5 marks)

3. Model test: use the test input data and test output data to verify the accuracy of the linear regression model. (5 marks)

4. Multi-dimensional analysis: construct a linear regression model and verify the accuracy of each model withDistanceFromHome(5 marks), andMonthlyIncome(5 marks).

5. Compare the model accuracy ofage,DistanceFromHome, andMonthlyIncome, and find out the most relational element. (5 marks)

Task 3 Multi-dimensional classification (30marks)

Machine learning algorithms use a better multi-dimensional analysis method than data statistics, leading to better results. Please complete the following steps to establish a machine learning model for employee attrition and output the test results.

1. Data preparation: select 80% of the 12 dimensions of data as training input data and the remaining 20% as test input data randomly. (5 marks)

2. Model training: use the training input data with KNN algorithm to train the classification model.

(5 marks)

3. Model test: use the test input data to verify the trained classification model and check the accuracy rate. (5 marks)

4. Model tuning: traverse the K of KNN algorithm from 2-10 to find the K with the best accuracy.

(5 marks)

5. Test data load: read the data in the given Attrition_test. csv file. (5 marks)

6. Blind test output: use the best KNN classification model to predict the attrition of each employee and save the final results in theAttrition_result.csvfile. (5 marks)

Sample Output

Sample of Attrition_result.csv

Marking Criteria

The following criteria will be used to assess the Coursework 2 (individual) assignment. Outstanding:

Correct output, correct variable type usage, good naming rules, good memory control, strong semantic and readability.

Appropriate:

Correct output, correct variable type usage, good naming rules, poor memory control, poor semantic and readability.

Needs improvement:

Correct output, good naming rules, wrong variable type usage,poor memory control, poor semantic and readability.

Hard to understand

Correct output, poor naming rules, wrong variable type usage,poor memory control, poor semantic and readability.

No submission or missing section

No submission or missing section including code and report

Steps

Basis of marking

Marks

1

Code quality and implementation results

·Outstanding: 5

·Appropriate: 4

·Needs improvement: 3

·Hard to understand: 2

·No submission or missing section: 0

5

2

Code quality and implementation results

·Outstanding: 5

·Appropriate: 4

·Needs improvement: 3

·Hard to understand: 2

·No submission or missing section: 0

5

3

Code quality and implementation results

·Outstanding: 5

·Appropriate: 4

·Needs improvement: 3

·Hard to understand: 2

·No submission or missing section: 0

5

4

Code quality and implementation results

·Outstanding: 5

·Appropriate: 4

·Needs improvement: 3

·Hard to understand: 2

·No submission or missing section: 0

5

5

Outstanding: 5

Appropriate: 4

Needs improvement: 3

Hard to understand: 2

No submission or missing section: 0

5

6

Code quality and implementation results

·Outstanding: 5

·Appropriate: 4

·Needs improvement: 3

·Hard to understand: 2

5

·No submission or missing section: 0

7

Code quality and implementation results

·Outstanding: 5

·Appropriate: 4

·Needs improvement: 3

·Hard to understand: 2

·No submission or missing section: 0