COMP20008 Semester 2 Exam 2021
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
2021 Semester 2 Exam
Question 1 |
2 pts |
A medical practice uses a unique consultation code to represent each patient consultation. The code is made up of the following: - The type of each consultation, represented by a sequence of two or three capital characters from the set A, B, C, D or E - Two asterisk symbols - The patient ID, represented by a six digit number - Another two asterisk symbols - A four digit number indicating how many times the patient has visited the practice. For example, the following are all valid patient codes: |
ACE**234123**0001 DE**123456**1024 ABC**000111**9999 Write a regular expression that will capture all valid consultation codes. Ensure that the six digit patient ID can be extracted as the first capture group in the expression. |
||||
|
Question 2 |
3 pts |
The following regular expression is designed to be a sentence tokenizer. [\s]*([^.]*)\. a) Explain how the expression works to tokenize sentences. (1 mark) b) Suggest two strings for which the expression may not fulfill its intended purpose and explain why. (2 marks) |
Question 3 |
2 pts |
Calculate the Sorensen-Dice similarity between the following words using character tri-grams including padding: drive drove Enter the similarity as a numeric value in the box below.
|
Question 4 |
2 pts |
Match each of the following histograms to one boxplot letter: |
Question 5 |
2 pts |
|||||
A data scientist is given a JSON file containing the results of football fixtures, similar to the one you encountered in Assignment 1. He wishes to extract data from the file but does not have a library available to read JSON files. As a result, he uses an online 'JSON to CSV converter' tool to produce a CSV file, but his program is not able to parse the resultant file as he expects. a) Explain the most likely reason why this might be the case (1 mark) b) Suggest another format he could use to represent the data and explain why it would be more suitable than CSV (1 mark) |
||||||
|
Question 6 |
1 pts |
Max is having a conversation about data integration. He says “Using blocking for record linkage between two datasets (dataset A and dataset B) is a bad idea. It is too time consuming to assign the records to blocks. It is much better instead to directly compare the records in A against the records in B without using any blocking step” . Argue why Max’s statement is incorrect. (1 mark) |
|
Question 7 |
4 pts |
Business X and Business Y have decided to conduct a joint marketing campaign. For this marketing campaign, they need to determine how many customers they have in common (how many people are in the customer list of both businesses). They implement the following 2 party privacy preserving protocol, making use of the SHA-256 one way hashing function. #In the following, the ’+’ symbol indicates string concatenation (joining two strings) #Business X does the following SetX=empty For each customer at Business X SetX=SetXUSHA-256(“First Name”+“Last Name”) Send SetX to Business Y #Business Y does the following SetY=empty For each customer at Business Y SetY=SetYUSHA-256(“First Name”+“Last Name”) result=count(SetX∩SetY) Share result with Business X |
2022-07-12