Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Python, R and data structures  Group Coursework 3

Date set: 29/01/23

Date to be submitted: 20/02/23

Introduction

Again, apologies for not getting this up before Christmas due to illness but hopefully the three weeks at the start of term make it easier to carry out the work than having to find some time during              revision.

This coursework is slightly different to the ones you have done in Python as we will assume the user is happy to type in functions.  The main goal instead is to make it easier for them to achieve what     they require by creating user defined functions rather than having to use R syntax.

I am also assuming that your functions in part iii) will interact with the data frame stored on the hard drive/ssd/cloud as this will make the form of the functions easier for the user to type in (as they will  not need to pass over a data frame and receive back a data frame).

There are lots of mini tasks in this coursework so although there are more parts some of the solutions can be only a few lines and there is less need to think about various data structures compared to the Python coursework.

Question

A company runs a death benefit scheme for its employees.  The scheme is free to join but to be part of it the employee must take a brief medical check-up where their height (in cm), weight (in kg) and smoking status is taken.  While they are part of the scheme, they can always take part in further       medical checks, but this is not compulsory.  However, if a check-up does occur, the new weight and smoking status are recorded and where necessary, the data is updated.

The scheme only allows employees to join on 1st January in any year.  When they join, their current age on the 1st January is stored in this particular benefits system with the year they joined the scheme.

New entrants to the scheme can be aged between 21 and 55 inclusively.  It is assumed that employees will have a BMI (which is calculated by weight (kg) / [height(m)]2 ) to be between 15 and 45.  Note that BMI is not included in the data file that you receive initially.

The smoking status of an employee is one of – Never smoked, Smoker, Vaper, Ex-smoker.  Note that if a person was a Vaper and quits then they would be classed as an Ex-Smoker.

The data for 20 new entrants to the scheme as of 1/1/23 has been taken and the initial data can be found on Moodle in the file dataraw.csv.

The code writing/file management can be seen as being in three parts and I expect them to be submitted as three separate files of code.

i)            Data verification  using data held in dataraw.csv

While the file on Moodle (dataraw.csv) is small, it is assumed to be only a sample.  The columns data that is included in the file is the employee number, the surname of the employee, the year they joined the scheme (which is 2023 for all the employees in this file), the age of the employee when they joined the scheme, the height (in cm), the weight (in kg) and the smoker status of the employee.

Your first task is to write code that verifies that the data meets the criteria stated above  - unique employee number, calculated BMI is between 15 and 45, age between 21 and 55 and the smoker status is a valid status.  Of course, you can do this by sight with this small dataset but the idea (i.e. what you get marks for) is to write code that will allow the user to easily identify entries that have suspect data and report back to the main data controllers with the queries they have.

I am assuming that the most sensible way of doing this is to read the file into a suitable data frame    and then generate a small error report that details the problem records and why there is a suspected problem.  Again, you want this to be as easy to use as possible for the user and the output should be easy to read and understandable.  In particular, when running R, results can be generated but hard to find within the code that has been run so think about when and how you will print the problem records you have identified.  Note that the user i.e. me, is expected to change the file  location in your code but it should be clear where this is.

ii)           Post data verification  using the data held in dataclean.csv to build and save the initial data frame

It is assumed that you have fed back the queries you found in part i), the main data people have fixed the problems and have now supplied you with a cleaned-up data file (dataclean.csv).  With this file you need to build your initial working data frame.  To do this you need to add:

a)   A column that contains the current BMI of the employee should be added.  In addition,  having BMI as only a numeric value is not as useful as also holding the category that this defines.  One set of definitions that are currently being used are:

•    Less than 18.5 – Underweight

•    Between 18.5 and 25 (though not including 25) – healthy weight

•    Between 25 and 30 (though not including 30) – overweight

•    Between 30 and 40 (though not including 40) – obese

•    40 and over – severely obese

Add an additional column for the BMI category and add the correct category to each policyholder.  Assuming that you are now using the dataclean.csv file you should find that all your calculated BMI’s here are within the acceptable range defined above.

b)   To be able to keep track of the employees who are members of the scheme (see part iii below) the data frame needs the following five columns to be added

•    Current Age

•    Age at withdrawal

•    Year of withdrawal

•    Age at death

•    Year of death

For the first of these columns, the current age will be the age of the employee when they joined the scheme.  For the other four columns, as all the employees at the moment are  current employees in the scheme, these columns should contain only the values NA.

c)    Once the data frame has been completed it should be saved as a csv file.  Again, the user is expected to change the file name in the code, but make it clear where this is.  It is up to      them to make sure they do not write over an existing file i.e. this is not your worry and       doesn’t need to be tested.

iii)          Maintenance of the file/data

The company will want to maintain and analyse the data as time progresses.  To help with this they  want you to write some code that allows them to use user-defined functions to carry out the             following tasks rather than them having to use in built R functions and the standard syntax.                (Remember to give your functions sensible names!)  Unlike our Python courseworks, we will assume the user will enter all the necessary information in the arguments of the relevant function (if              arguments are needed). The following functions are required:

a)   A function that will show the current state of the data frame.

b)   A function that will show the employee numbers and names of the employees that are currently alive in the data frame.

c)    A function that is run at the start of the new year that ages the employees who are still alive by adding one to their current age.

d)   After the start of the new year function (c)) has run, but before any of the following

functions have run, the user can add new members.  The function will require the data – employee number, employee surname, age they joined, height and weight.  Before          adding the entered values to the data frame, your code should make sure that the same checks as applied in part i) are carried out i.e. that the employee number is unique, that the age is between 21 and 55 inclusive and that the calculated BMI is between 15 and     45.  If any of these checks fail the user should be informed and the data isn’t added.        Otherwise, the missing data e.g. BMI should be calculated/added and the new employee should be attached.  Note that you can calculate the current year with data in your data frame (ignoring the possibility that all members of the data frame are no longer active).  Once any of the below functions (e-j) have run, the user cannot add any new members   until the new year function above has been run again.

e)   A function that allows the user to record the deaths of any employees.  This function will be able to take one or more employee numbers.  If any of the employee numbers do not exist, then the user should be told; if the employee number relates to an employee who is already recorded as dead or withdrawn then the user should be told.  For valid              employee numbers i.e., ones relating to employees who are currently alive and part of    the scheme, then their record will be altered so that the current age is now recorded as  the age at death and the current year is also calculated and recorded as the year of          death.  The age in the current age column is now switched to NA to indicate that the        employee is no longer an active member of the scheme.

f)    A function that allows the user to record the withdrawals of any employees.  This           function will be able to take one or more employee numbers.  If any of the employee    numbers do not exist, then the user should be told; if the employee number relates to  an employee who is already recorded as dead or withdrawn then the user should be     told.  For valid employee numbers i.e., ones relating to employees who are currently     alive and part of the scheme, then their record will be altered so that the current age is now recorded as the age at withdrawal and the current year is also calculated and         recorded as year of withdrawal.  The age in the current age column is now switched to NA to indicate that the employee is no longer an active member of the scheme.

g)   A function that allows the weight of an employee to be updated.  This will also update  the BMI and make the standard check that it is between 15 and 45 and update the BMI category.

h)   A function that allows the smoking status of an employee to be updated.  This function will only update the record to one of the recognised statuses listed above and you        cannot move to Never Smoked.

i)    A function that summarises the ages of the population by giving the mean, median and standard deviation of the age of current policyholders, age at death of deceased            policyholders and age at withdrawal of withdrawn policyholders.  This is either for all    policyholders or the user can ask for it to be broken down by smoking status or BMI      grouping (they are only allowed to pick one of these at any time).

j)    A function that saves data frames that are subsets of the main data frame. These data    frames will be split by either smoker status or BMI grouping.  So for BMI five data frames may be produced (though some may be null) and for smoker status there will be four      data frame saved in separate files (you choose the file names).

Rules

For all three parts, I will load your data into the standard R console and run all the code for that part i.e. I will run part i) on its own, then load in part ii) and run that, etc.

For part i), all I expect to need to do is to change the file location in the code before running it and I expect I will get a summary of the data that looks to be in error from the file with the errors clearly laid out.

For part ii), all I expect to need to do is to change the file location for the files to be read in and written out, but the code should just run and create and save the data frame.

For part iii) I again expect to change the file names.  Note that as there are a few output files in the   task j) above you may want to have a directory path (e.g. “U:\\data\\”) and a file name separately so all I need to do is change the directory path if I like the file names you have chosen.  When I run the  code, I do not expect much to happen as it should nearly all be driven by functions that I will need to call with the correct parameters

To help the user run your code you also need to supply a brief guide to your functions as a Word/pdf document explaining to the user how your code is run.  This is not an explanation of how the code     works (as this will be done by comments in your code!).  For example, for task iii)a) you may tell the  user that entering member.summary() will show the current status of all members.  For task iii)i) you may have a function called age.summary(x) and you will explain that entering this function will allow them to see the mean, median and standard deviation of the ages of the policyholders by category x where x can be ‘All’, ‘Smoker status’ or BMI’ .

As noted above, I am assuming that your functions in part iii) will interact with the saved data frame so that the user can type in their functions as simply as possible i.e., for the mortality function you   may have something like :

deaths(E110234, E110237)

Rather than:

member.records<-deaths(member.records, E110234, E110237)

Submissions

As noted above, there will be four files to submit – R code for parts i), ii) and iii) and a user guide for part iii) functions.

While R does not crash as much as Python, you should still be checking the data that the user types in and informing them why it is incorrect e.g. wrong type of data or the employee number doesn’t  exist, rather than just getting some R errors or nothing happening.

Marking

Coding – 60% - code working correctly, sensible error messages, easily understood results being produced, etc.

Comments in the code and sensible function names for user experience – 20%

User guide – 10% - explain function use, parameters to be entered and results that will be obtained Use of GitHub – 10 %