Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

STQD6414

Data Mining

Assignment 1

You are provided Set1_assignment1_HRdata.csv dataset. The dataset contains the details of all the employees in a company.

1. It seems like column deptid_gender has been merged wrongly. Split this column into deptid and gender.

2. Combine birthdate, birthmonth, and birthyear into a single column called

birthdate. Is there any missing value in this newly created column?

3. Convert the column you created in 2 to a proper date format. Oops! The conversion introduces some missing values. Why does this happen?

4. Remove the rows with the missing birthdate values from the data.

5. Check also the format of other date columns.

6. Create a new column called age using the column you created in 2. Calculate age as of 31 Oct 2022. Make sure the unit is in years (1 year = 365.25 days).

7. Create a new column called status that consists of either "active" or "inactive" using the exitdate. Note that for an active employee, the exitdate is NA.

8. Create a new column called lengthofservice using the enterdate and exitdate. This column measures how many years an employee has worked for the company. Note that for an active employee, calculate the lengthofservice as  of 31 Oct 2022.

9. Discretise lengthofservice to three groups of equal frequency and store the output as a new column called losgroup. Make sure your column is in the form of ordered intervals: [0,13.5], (13.5,25.5], (25.5,46.6].

10. Now use the Set1_assignment1_FINdata.csv to add each employee's salary to the table. Note that the salaries reported in the dataset are different for different years,   so make sure to filter for the latest year.

11. Subset your data by retaining only the active employees and the following columns: staffid, deptid, status, gender, namefirst, namelast, losgroup, and salary.

12. For the data frame created in 11, plot relative frequency histograms of the salary

with the following breaks:

(a) breaks=seq(0,70000,by=10000)

(b) breaks=seq(0,70000,by=5000)

Which histogram describes the variable best? Explain.

13. For the data frame created in 11, plot boxplots of the salary for each interval in the

losgroup. Describe your plot.