TTTC6424 FUNDAMENTAL OF DATA SCIENCE SEMESTER 1 ACADEMIC SESSION 2022-2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
FINAL EXAMINATION
SEMESTER 1 SESI AKADEMIK 2022-2023
SEMESTER 1 ACADEMIC SESSION 2022-2023
TTTC6424
FUNDAMENTAL SAINS DATA
FUNDAMENTAL OF DATA SCIENCE
QUESTION 1 (15 Marks)
Given the following business problem, answer the following questions.
A long-distance telecommunications network processes massive numbers of transactions. Records of errors on the network are stored with the objective of mining these data to find trends and patterns that characterize faulty behavior. A worldwide network is an interconnected structure of complex devices with a huge number of different paths and circuits. Many sophisticated devices interact to achieve very high reliability, but errors occur, and a record of these errors is made. Because of extensive redundancy and alternative paths, very few of these errors have an immediate effect on the transmission of data or voice communications. An acute problem is usually easy to determine. However, transient or chronic problems are difficult to pin down. For chronic faults, there is no immediate indication of a circuit problem. For transient faults, at the time a circuit is tested and measured, there may be no evidence of a large sample of recorded problems to see whether there are patterns that are predictive of chronic and intermittent faulty behavior. Data are recorded in a coded alphanumeric format. Errors are logged with identifying information such as type of error, the time and the connecting circuit.
a) Propose TWO (2) analytics business questions that can solve the above business problem. (5 marks)
b) Describe the predictive model development processes using the CRISP-DM specific to answer one (1) business question in 1(a). (10 marks)
QUESTION 2 (20 marks)
Given a training data set Y in the Table 1.
Table 1 Dataset Y
A B C Class
15 1 A C1
20 3 B C2
25 2 A C1
30 4 A C1
35 2 B C2
25 4 A C1
15 2 B C2
20 3 B C2
a) Find the best information (Gain) values for attribute A. (5 marks)
b) Find the best information (Gain) values for attribute B. (5 marks)
c) Find a decision tree for data set Y. (5 marks)
d) If the testing set is in Table 2, what is the percentage of correct classification using the decision tree developed in c)? (5 marks)
Table 2 Testing Set
A |
B |
C |
Class |
10 |
2 |
A |
C2 |
20 |
1 |
B |
C1 |
30 |
3 |
A |
C2 |
40 |
2 |
b |
C2 |
15 |
1 |
B |
C1 |
QUESTION 3 (20 marks)
Table 1 is a database of items bought in several transactions at a grocery store. Assume minimum support, min_sup = 60% and minimum confidence, min_conf = 80%.
Table 1 Transaction Database
Transaction ID |
Date |
Items |
T1 |
30/10/2022 |
K, A, D, B |
T2 |
30/10/2022 |
D, A, C, E, B |
T3 |
30/10/2022 |
C, A, B, E |
T4 |
31/10/2022 |
B, A, D |
a) Using the Apriori algorithm, find all frequent itemsets. Show all the steps for this process. For each iteration, you must show the candidate and frequent itemsets. Present your work similar to the example shown in class. (16 marks)
b) List all strong association rules with their support and confidence values. (4 marks)
QUESTION 4 (5 marks)
For each of the following tasks, specify what data mining task best matches the tasks' objectives (examples of tasks are classification, clustering, association, and regression. State what the instances are and the attributes in each case. The training and test data are not stated explicitly, and you will have to infer what are the possible training and test data from the statement.
a) Discovering the subspecies within a parrot species based on the characteristics of individual parrot, such as size, shape, speed, color, and life expectancy. (5 marks)
Task:
Justification of the task:
Instances:
Attributes:
2023-05-25