Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ELEC9741 - Electrical Engineering Data Science

Part II

Exam, 2023

Instructions

1.    Due Tuesday (15th August) 10pm, Moodle

2.    Typed only No handwritten

3.    Submit via Moodle

4.    Computer output: No discussion in the answer = no marks

5.    Analytical results: No working = no marks

6.    If using third part code/toolboxes/libraries make it explicit

7.    When explaining anything, always do so with appropriate equations (wherever possible) 8.    Open book but NO DISCUSSION/CONSULTATION with anyone and NO COPYING

9. The exam has a 15-minute oral assessment component. During the oral assessment you will be required to run and explain your code and answer questions about your submission to this exam. The oral assessment  will be scheduled after the submission (the 16th  and 17th  of August). Instructions for scheduling timeslot for   the oral assessment will be sent via email.

1. Data Modelling (10 marks)

A firm produces solid metal cylinders and needs to manufacture them with extremely tight tolerances in terms of both dimensions (diameter and height) and weight. The firm has 10 factories of equal capacity running around the clock in 10 different cities. However, the machines in one of the factories have been adversely affected by the environment and consequently produces cylinders of slightly different dimensions (these are referred to as ‘affected cylinders’ in the equations below). The quality assurance team have an automatic measurement and classification system at the only warehouse belonging to the firm (supplied by all the factories) that inspects each cylinder to decide if it can be shipped or if it should be recycled for not meeting specifications. The measured data can be modelled as:

x = s + η

xd  = sd  + η

xw  = sw  + E

Where,   x ,   xd    and   x w     are   the   measured   height,   diameter   and   weight   of   a   cylinders;   s~N(s|μ, σℎ(2)),

sd~N(s|μd, σd(2)) and sw~N(s|μw, σw(2)) are the true height, diameter and weight of the cylinder being measured;

η~N(x|0, ση(2)) is the measurement noise in the length measuring device and E ~N(x|0, σE(2))   is the measurement

noise in the weight measuring device. Note that for the normal distributions, σ denotes the standard deviation and μ the mean, they are given as σℎ   = 0.5mm, σd   = 0.25mm, σw   = 5g, σn   = 0. 1mm, σE   =  1g, and

μ = {100mm, standaTd cylindeT

          102mm, affected cylindeT

μd  = {40mm, standaTd cylindeT

           41mm, affected cylindeT

μw  = {1000g, standaTd cylindeT

           1010g, affected cylindeT

a.    Comment on the shape of the decision  boundary. Explain your answer with appropriate equations describing  the  underlying  models  about  data  distribution  and  the   implications  on  the  decision boundary. [2 marks]

b.    The quality assurance team does not know how many factories are producing affected cylinders, but they suspect it might be one or two. Give the equations for the optimal decision surface for both assumptions  (one  factory  is  producing  affected  cylinders,  two  factories  are  producing  affected cylinders) and based on this, provide a classification rule of the form

̂(y) = {affe(stan)c(d)ted cylindeT(aTd cylindeT)         f(f)x(x) 0(0)

where, x = [x, xd, x w]T  isa vector of measurements. (i.e., give an expression for f(x)). [3 marks]

c.    Implement a suitable simulation (of the measured data) in  MATLAB and use it to demonstrate that your classification rule works. [5 marks]

2. Machine Learning Pipeline (25 marks)

You have been provided with the health records of 58,976 patients who were admitted to critical care units (CCU). Each patient's record is represented by 16 numerical values and stored in a CSV file called " MedicalRecords.csv" (You can download this from the course webpage). The first line of the file contains the header, while the subsequent lines hold individual patient records separated by commas.

Among the  16 numerical values, "LOSdays" represents the number of days each patient stayed in the CCU, from admission to discharge. The other 15 values represent the daily average counts of various medical events, such as callouts, diagnoses,  procedures,  CPT events,  input  events,  labs,  microbiology  labs,  notes,  output  events,  medical prescriptions, procedural events, transfers between care units, chart events, and the summary of all the daily averages.

Part1: Regression Problem

Your task is to design a machine-learning pipeline for regression to predict the number of days a patient is expected to stay in the CCU (LOSdays) based on the other 15 values.

a.    Design and implement the pipeline. You may use the following steps as a guideline:

Prepare your data for the model training. Define your target and features and split the data into training and testing subsets.

Chose the appropriate regression model to train: You may select any regression model, such as Linear Regression, Support Vector Regression, Random Forest Regression, etc. You can utilize MATLAB built-in functions for these models. Use the training subset in the training of the model.

Chose the appropriate evaluation metric(s) for the task and evaluate the performance of your trained model on the test subset. Note that you are expected to implement them with your own code. The use of MATLAB built-in functions for evaluation metrics is not allowed. [10 marks]

b.    Compute the mean, median, standard deviation, minimum, and maximum of all features. What do you observe? [1 mark]

c.    Based on the statistics computed in (b), is a feature normalisation step needed? Justify your answer. [1 mark]

d.    Train and evaluate the model with and without feature normalisation step. Analyze the influence on performance. [1 mark]


e.    Compute the correlation between each input feature and the target output (LOSdays). What do you observe? Note: use your own code to compute the correlation. using MATLAB’s built-in functions is not allowed. [1 mark]

f.     Based on the correlation values computed in (e), reduce the dimension of features (number of

features) to 10, 4, and 2. Retrain the model for each feature size and compare the performance. What do you observe? [1 mark]

Part 2: Classification Problem

In this part, your objective is to design a machine-learning pipeline for classification to predict whether a patient will have a long, medium, or short stay based on the 15 numerical values used in Part 1.

The duration classifications are defined as follows:

Short period: When the patient is expected to stay for less than 6 days.

Medium period: When the patient is expected to stay for 6 days or more but less than 12 days. Long period: When the patient is expected to stay for 12 days or more.

g.   Prepare your data for the classification task and replace the regression model from Part 1 with a

classification model, utilizing any suitable method like Logistic Regression, Support Vector Machine classifier, Random Forest classifier, etc. You can use MATLAB's built-in functions for training the classification model. [5 marks]

h.   Justify whether the same training/testing split used in Part 1 is valid for training the classification model. If not re-split the data appropriately.[1 mark]

i.    Determine whether the evaluation metric used in Part 1 is still valid for assessing the classification model. If not, suggest and implement other evaluation metrics. [1 mark]

j.    Perform binary classification of patients who are expected to stay only one day or less and patients who are expected to stay more than one day. Re-split the data into training/testing, retrain the model, and compute the model's performance. [1 mark]

k.    Compute the confusion matrix of the test set for the binary model in (j). How many samples there for each class? Analyse the observations. [1 mark]

l.     For the system in (j). Compute accuracy, balanced accuracy, and F1. Which metric you would use to report the system performance in this case? Justify your answer. [1 mark]