Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

You must use your individual generated pair of data sets, data_wrangling_rl1_2023.csv and data_wrangling_rl2_2023.csv, and the corresponding ground truth data set, data_wrangling_rlgt_2023.csv, for all tasks of this assignment.

Assignment Tasks

The tasks for this assignment are similar to what you had to do in lab. You are required to run your record linkage program (including any modifications you have made to this program) on your individual two data sets and the ground truth data set (generated as described above), and address the following questions:

Task 1: Blocking (6 marks):

(a) How does blocking affect your results? Specifically, describe your choice of blocking method and choice of blocking keys. Discuss which attributes and/or attribute combination(s) in the given data sets were useful as blocking keys and which were not, and why.

(b) If there is a trade-off between performance (reduction ratio, pairs completeness, and pairs quality) and the quality of the final record linkage results, where do you think the optimal balance is, and why?

(c) Do you think this trade-off would change on different data sets with different levels (both low and high) and characteristics of data quality? If so, how and why?

Write a minimum of 400 and a maximum of 500 words in total in the corresponding answer field. Clearly indicate your answers for (a) to (c).

Task 2: Comparison and Classification (6 marks):

(a) How do different comparison techniques affect linkage results? Discuss and justify how you selected appropriate comparison functions for different attributes, and why these selected functions are suitable while others are not.

(b) How do different classification techniques using different parameter settings affect linkage quality? Discuss and justify how you selected an appropriate classification technique and corresponding parameter settings to obtain high linkage quality and why other classification techniques are not or less suitable.

(c) As discussed in the lectures in week 8, for suitable linkage quality measures, describe how the final record linkage quality changes with the choice of different parameters and techniques? Is the record linkage quality particularly sensitive to certain parameters, or choice of comparison or classification techniques? If so, why is this the case?

(d) Are there any evaluation measures that are not useful? Describe why these measures are not useful in evaluating the performance of a record linkage project.

(e) Provide the numerical linkage evaluation results for other (not optimal, see below) parameter settings that you have used (you only have to provide the output file for your best obtained linkage results – see next task).

Write a minimum of 400 and a maximum of 500 words in total in the corresponding answer field. Clearly indicate your answers for (a) to (e).

Task 3: Optimal Settings (4 marks):

(a) What is the best linkage quality result you are able to achieve, both in the blocking and the classification steps? Why do you think this combination of parameters and techniques worked well for your data set pair?

(b) Are the results equally good for all evaluation measures discussed in the lectures in week 8, or only for some? If the results are good only for some measures, why do you think the results are not good for other measures?

Write a minimum of 150 and a maximum of 250 words in the corresponding answer field. Clearly indicate your answers for (a) and (b).

In addition to answering this task, you must also submit the output file which contains the linked and classified matching record pairs (as a CSV file) for the best linkage result you were able to obtain.

You must name the file you upload as data_wrangling_rl_best_results_2023.csv. You must use the Python program saveLinkResult.py which we use in lab to write the linkage output into a file. Your submitted output file must exactly follow the CSV file format as generated by this programme! We will use a program to check linkage quality using this output file to validate what you write in your answers. If our program does not work with your submitted file because it does not follow the required file structure then you will loose marks.

Task 4: Data Quality (4 marks):

(a) How dirty are these new data sets you generated for this assignment compared to all the data sets you have worked with in labs 3 to 7? Describe your impression after having conducted the linkage on the different data sets used in the labs.

(b) How can you determine this? Describe the methodology you used to assess the quality of the data sets we provided for this assignment and compare it against the quality of the datasets from labs 3 to 7 (such as any calculations you used to determine the data quality).

Write a minimum of 150 and a maximum of 250 words in the corresponding answer field. Clearly indicate your answers for (a) and (b).

Marking:

For each of the tasks described above you will receive up to the shown mark for appropriately answering the corresponding questions, and describing and justifying what you have done.

For Task 3, we will also compare your answers to the numerical results we obtain from your submitted file of linked records. You will loose marks if the numerical results we obtain differ from what you describe in your textual answers.

For numerical answers, round the final numerical results to two decimal places (eg: 0.01 or 42.42).