CDS503 – Machine Learning 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
First Semester Examination
2020/2021 Academic Session
CDS503 – Machine Learning
1. (a). Question 1(a)(i) to (iii) are based on Table 1.
Table 1 shows the set of training data consisting of bank loan applications approval based on applicant’s average monthly saving and outstanding mortgage loan, which “+” sign means “loan granted”, while “-“ sign means “loan not granted” .
Table 1
Average monthly saving (RM) |
Outstanding mortgage loan (in ‘000 RM) |
Loan granted |
250 |
150 |
- |
300 |
60 |
- |
450 |
175 |
+ |
500 |
135 |
+ |
500 |
230 |
- |
600 |
150 |
+ |
600 |
90 |
- |
700 |
185 |
+ |
700 |
230 |
- |
780 |
140 |
+ |
950 |
55 |
- |
970 |
153 |
- |
(52/100)
(i). Based on the dataset in Table 1, construct Class C in 2D graph by showing the positive and negative instances. Identify value of S1, S2, M1 and M2 where
S1 <= average monthly saving <= S2) AND (M1 <= outstanding mortgage loan <= M2.
(ii). Figure 1 shows the formalized supervised
instance space X, target function y = f(x) space Y. The hypothesis space is the set search.
Figure 1
Construct a hypothesis space Class H that also shows the potential area of false positive (FP) and false negative (FN). Explain the error E(h|X] of false positive and false negative in this bank loan applications approval.
(iii). Construct a hypothesis space Class S that may create the potential
of overfitting. Describe how overfitting happens in the context of this bank loan applications approval.
(b). Table 2 shows the training data to classify the sport type of an athlete,
which is based on athlete’s height, maximal voluntary contraction (MVC) and maximal oxygen uptake (Max O2).
Table 2
Name |
Height |
Maximal Voluntary Contraction (MVC) |
Maximal Oxygen Uptake (Max O2) |
Sport type |
Osman |
tall |
moderate |
moderate |
Running |
Thevan |
medium |
low |
high |
Badminton |
Seng Huat |
tall |
high |
moderate |
Running |
Ramu |
medium |
moderate |
low |
Running |
Ahmad |
short |
low |
high |
Badminton |
Muhamad |
tall |
low |
high |
Badminton |
Chin Huat |
medium |
high |
moderate |
Running |
(48/100)
2. (a).
(i). By using K-Nearest Neighbour (KNN) algorithm, compute the sport type suitable for the person with medium height and medium Maximal Voluntary Contraction (MVC) and high Maximal Oxygen Uptake (Max O2).
The value of k is 3 and proximity metric used is Euclidean Distance.
Show your workings.
(ii). Determine the sport type of the same person with k = 7.
Conclude your result as compared to 1(b)(i).
(iii). By using Naïve Bayes algorithm, compute the sport type suitable
for the person with medium height and high Maximal Voluntary Contraction (MVC) and low Maximal Oxygen Uptake (Max O2).
Show your workings.
Figure 2
(32/100)
(i). Based on the Decision Tree (DT) shown in Figure 2, explain the reason of the feature ‘Age’ becoming the root of the DT. Your justification should include the information gain and degree of the purity.
(ii). List the labels involved in this DT.
(b). Email spam is annoying, filling up our inbox and making it hard to find
genuine emails. In order to protect our email server from being overloaded with non-essential emails, the spam filters are used.
Consider the following information:
The information given as below:
• 2% of email in inbox being filtered based on specific keyword is considered as spam.
• 85% of email that is spam contains keyword “you win” .
• 8.5% of email that is not spam also contains keyword “you win” .
An email in inbox is being classified contains the keyword “you win” when being filtered. Using Bayes’ theorem, calculate a probability that this email is spammed. Show your working.
(32/100)
(c). Suppose you are using a Support Vector Machine (SVM) classifier with 2 class classification problem as shown in Figure 3. Now you have been given the following data in which some points are circled that are representing support vectors.
Figure 3
(36/100)
(i). Determine whether the decision boundary will change if you remove any one of circled points.
(iI). Determine whether the decision boundary will change if you
remove any one of non-circled points.
(iii). Explain the cost parameter in SVM and how it effects the
smoothness of the decision boundary.
3. Answer all questions
(a) Given the following dataset in Table 3.
(36/100)
(i) Compute the parameters (coefficients), w0 and w1 of the linear regression model using least squares method.
Table 3
Y |
X |
1.5 |
2.0 |
2.1 |
2.4 |
1.9 |
2.5 |
2.8 |
2.8 |
2.1 |
2.9 |
2.0 |
3.0 |
2.6 |
2.9 |
2.2 |
3.2 |
2.7 |
3.3 |
3.1 |
3.6 |
(ii) Compute the predicted Y value given X = 4.5 .
(b) Given a two-dimensional dataset as shown in Figure 4. Suppose the
centroids are (3,2) and (5,5), compute the new centroids of the clusters after K-mean method is applied for one iteration.
Figure 4 (32/100)
(c) Consider performing the hierarchical agglomerative clustering algorithm on the following set of data points as shown in Figure 5. Assuming we stop when only two clusters remain. State the linkage method that ensures two balanced clusters will be formed (each have two data points). Explain your
answers.
Figure 5
(32/100)
4. Answer all questions
(a) Given a two-dimensional dataset, we want to represent the data in only one
dimension. There are two methods available: Principal Component Analysis and Linear Discriminant Analysis. State which method to be used to reduce the dataset. Explain your answers.
3.5 5.5
4
5
6
3
4
4
5
4.5
(24/100)
(b) You are tasked to build a classification model for a prediction problem. The
dataset is large with low number of noisy samples. However, it is found that a single model gets a very low performance. Thus, it is decided that the ensemble learning methods bagging or boosting is to be used to build a better classification model for the prediction problem. Choose a method and explain your reason for choosing the method over another method.
(40/100)
(c) A data scientist builds an ensemble classifier using stacking method. The ensemble classifier consists of three decision trees as the base models and a support vector machine as the meta-model. The ensemble classifier is evaluated on a test set and it is observed that the accuracy is lower than expected. Suggest an improvement that could be made to improve the accuracy of the classifier.
(36/100)
2022-08-08