Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

2021 Semester 2 Exam

Question 1

2 pts

A medical practice uses a unique consultation code to represent each patient consultation. The code is made up of the following:

- The type of each consultation, represented by a sequence of two or three capital characters from the set A, B, C, D or E

- Two asterisk symbols

- The patient ID, represented by a six digit number

- Another two asterisk symbols

- A four digit number indicating how many times the patient has visited the practice.

For example, the following are all valid patient codes:


ACE**234123**0001

DE**123456**1024

ABC**000111**9999

Write a regular expression that will capture all valid consultation codes. Ensure that the six digit patient ID can be extracted as the first capture group in the      expression.



Question 2

3 pts

The following regular expression is designed to be a sentence tokenizer. [\s]*([^.]*)\.

a) Explain how the expression works to tokenize sentences. (1 mark)

b) Suggest two strings for which the expression may not fulfill its intended purpose and explain why. (2 marks)


Question 3

2 pts

Calculate the Sorensen-Dice similarity between the following words using character tri-grams including padding:

drive

drove

Enter the similarity as a numeric value in the box below.




Question 4

2 pts

Match each of the following histograms to one boxplot letter:



Question 5

2 pts

A data scientist is given a JSON file containing the results of football fixtures,      similar to the one you encountered in Assignment 1.  He wishes to extract data   from the file but does not have a library available to read JSON files. As a result, he uses an online 'JSON to CSV converter' tool to produce a CSV file, but his     program is not able to parse the resultant file as he expects.

a) Explain the most likely reason why this might be the case (1 mark)                 b) Suggest another format he could use to represent the data and explain why it would be more suitable than CSV (1 mark)


Question 6

1 pts

Max is having a conversation about data integration. He says Using blocking for record linkage between two datasets (dataset A and dataset B) is a bad idea. It is too time consuming to assign the records to blocks. It is much better instead to    directly compare the records in A against the records in B without using any blocking step” . Argue why Maxs statement is incorrect. (1 mark)




Question 7

4 pts

Business X and Business Y have decided to conduct a joint marketing campaign.

For this marketing campaign, they need to determine how many customers they have in common (how many people are in the customer list of both businesses). They implement the following 2 party privacy preserving protocol, making use of the SHA-256 one way hashing function.

#In the following, the ’+’ symbol indicates string concatenation (joining two strings)

#Business X does the following

SetX=empty

For each customer at Business X

SetX=SetXUSHA-256(“First Name”+“Last Name”)

Send SetX to Business Y

#Business Y does the following

SetY=empty

For each customer at Business Y

SetY=SetYUSHA-256(“First Name”+“Last Name”)

result=count(SetXSetY)

Share result with Business X