Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

INFS3200 Advanced Database Systems

Prac 3: Data Linkage (5%)

Semester 1, 2022

Introduction

Learning objectives:

Learn how to use JDBC/cx_Oracle to interact with Oracle DBMS, including the operations

of create table, update table, query table, etc.

Learn how to measure the performance of data linkage.

Understand various string similarity measures, e.g., edit distance and Jaccard coefficient, for

field-level data linkage.

Marking Scheme:

This Prac carries 5 marks for 5% of this course.

1. 2 marks: Complete two missions in Task 1, each mission is accounted for 1 mark.

2. 1 mark: Complete Task 2, the correct answer for each bullet point is worth 0.5 marks.

3. 1 mark: Complete Task 3.

4. 1 mark: Complete Task 4.

Please include your screenshots of the data import and data linkage results (precision, recall,f- measure and time) in a word/pdf document, as well as your answers to the task questions. It is not mandatary to include your student ID in every screenshots as we can check the originality through your code submission.

The submission should be a zip/rar file, which includes both the answer document and your source code files. Please format your document and code nicely to help tutor’s marking process. A poorly formatted document may receive a reduced mark. Submit your work to the Blackboard website of the course by 16:00 (i.e., 4pm), Friday 20 May 2022.

Late Penalties (from ECP):

Where an assessment item is submitted after the deadline, without an approved extension, a late penalty will apply. The late penalty shall be 10% of the maximum possible mark for the assessment item will be deducted per calendar day (or part thereof), up to a maximum of seven (7) days. After seven days, no marks will be awardedfor the item. A day is considered to be a 24 hour blockfrom the assessment item due time. Negative marks will not be awarded.

Part 1: Preparation of the Restaurant Data Set

In this part, our objective is to prepare data for subsequent tasks. There are two options for data storage: (1) import data into Oracle database, or (2) store data as CSV files. Both solutions are implemented and available in either Java or Python. Although both options are available, we recommend you to try the database option, as you can learn more about how to interact with database using Java/Python code.

Option 1: Import data to Oracle database through JDBC/cx_Oracle

This option requires the following software:

(1) Oracle platform used in Prac1 & Prac2,

(2) Java JDK/Python library (Java 8/Python 3.8 recommended), and

(3) Java/Python IDE.

Here we give an example of Java with Eclipse, but others, like Intellij IDEA for Java or PyCharm for Python, will work the same. You need to go through the following 4 steps.

Step 1: Log in and Create Users

In this part, we first use SQL Plus” to create a database user, then we need to connect user to “SQL Developer” and interact with an Oracle database. In SQL Plus Command Line window, login to Oracle with a default username SYS AS SYSDBA” and password Password1!”, as shown below.

Follow the commands below to create a user:

/*Enable user creation*/

ALTER SESSION SET "_ORACLE_SCRIPT"=TRUE;

/* Create a user named “S1234567” (student id) with password “w” */

CREATE USER S1234567 IDENTIFIED BY w ACCOUNT UNLOCK DEFAULT TABLESPACE "USERS" TEMPORARY TABLESPACE "TEMP" PROFILE        "DEFAULT";

/* Grant DBA privilege to “S1234567” */

GRANT DBA TO S1234567;

Same as that of the previous Prac1 & Prac2, please change “S1234567” to your student ID.

Step 2: Use Oracle SQL Developer

Open SQL Developer and connect the user that we have just created.

Click the green + button to connect to the database. Fill the connection information in the prompted dialog window shown as above. The “connection name” is specified by the user, and the “username” is the S1234567” we have just created. “Password” is the password for that user, i.e., “w” in this case. You also need to change SID to orcl” . Then press Connect” button to connect to the user S1234567”.

Step 3: Import the Data Linkage Project via Java IDE

In this step, we import the Java code template to your Java IDE. Here we use Eclipse as our example. Other IDEs work in a similar way. From the Windows 10 Start” menu, you can search the Eclipse” .

When you open the eclipse” software, a dialog window will be prompted asking you to select a workspace, as shown below. Choose a folder where you want your Eclipse project to be stored.

Extract the DataLinkage Java project  downloaded  from Blackboard  course website to the workspace. At the eclipse Project Explorer” panel, mouse right-click and then choose the “import” function. From the prompted dialog window, click General Existing Projects into Workspace” and then press the Next” button. Browse the file folder to find the DataLinkage project in your workspace and press Finish”, as shown below. Now the project has been successfully imported into your workspace.

Step 4: Understand JDBC/cx_Oracle Connection to Oracle DBMS

The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database (the counterpart in Python is cx_Oracle, open a command prompt and enter “python -m pip install cx_Oracle --upgrade” to install cx_Oracle to Python). JDBC helps you to write Java applications that manage these three programming activities:

Connect to a data source, like a database,

Send queries and update statements to the database,

Retrieve and process the results received from the database in answer to your query.

In order to connect a Java application with the Oracle database, typically you need to follow five steps to perform database connectivity. In the following example, we use a database supported by Oracle  12c. Hence, we need to know some related information for the Oracle database management:

Driver class: The driver class for the Oracle database is oracle.jdbc.driver.OracleDriver” . URL: The URL for Oracle 12c database is jdbc:oracle:thin:@localhost:1521:orcl” where

jdbc” is the API, “oracle” is the database, “thin” is the driver, “localhost” is the server name on which Oracle is running (we may also use IP address), “1521” is the port number, and “orcl” is the Oracle service name.

Username: In this Prac, please use the S1234567” user that you have just created. Password: Password is assigned at the time of creating a user, i.e., “w” in this case.

The following simple code fragment gives an example of a five-step process to connect to an Oracle database and perform certain queries.

Example 1:

import java.sql.*;

class OracleCon {

public static void main(String args[]) {

try {

// step1 load the driver class

Class.forName("oracle.jdbc.driver.OracleDriver");

// step2 create the connection object

Connection con = DriverManager.getConnection("jdbc:oracle:thin:@localhost:1521:orcl", "S1234567", "w");

// step3 create the statement object

Statement stmt = con.createStatement();

// step4 execute query

ResultSet rs = stmt.executeQuery("select * from table");

while (rs.next())

System.out.println(rs.getInt(1) + "  " + rs.getString(2) + "  " + rs.getString(3));

// step5 close the connection object

con.close();

} catch (Exception e) {

System.out.println(e);

}

}

}

In this code fragment, we create a connection con to the database and store the query result of “SELECT * FROM table” to the result set rs and display them. Browse the DataLinkage project in your IDE and find the package Oracle” . We provide various Java classes for interacting with the Oracle database, including:

DBConnection: Basic connection settings and functionalities.

CreateTable: Create a table in the database

InsertTable: Insert tuples into a table

DeleteTable: Delete tuples from a table

ReadTable: Select tuples from a table

DropTable: Drop a table from the database

In order to play with these functionalities, you need to first change the username to the user you just created in DBConnection.java class.

CREATE TABLE

The dataset we are going to use in this Prac is a list of restaurants at various locations. The table contains four attributes: ID, Name, Address, and City. We use the following SQL statement to create this table in the database:

CREATE TABLE RESTAURANT (ID NUMBER NOT NULL,

NAME VARCHAR2(100 BYTE) NOT NULL,

ADDRESS VARCHAR2(255 BYTE) NOT NULL,

VARCHAR2(50 BYTE) NOT NULL,

RESTAURANT_PK PRIMARY KEY(ID) ENABLE);

Run the CreateTable.java class to execute this SQL query using JDBC. Open CreateTable.java, right-click the class, and choose Run As Java Application” . When the program terminates, check the table you created in SQL Developer.

INSERT INTO TABLE

The class InsertTable.java reads all the restaurant records from the excel file we have provided in  this  Prac,  i.e.  “data\restaurant.csv”,  and  insert  these  records  one  by  one  into  the RESTAURANT table using JDBC. Run InsertTable.java and then check the result in SQL Developer.

SELECT FROM TABLE

The class ReadTable.java reads all the restaurant records from table RESTAURANT in the database by executing the SQL query, i.e., “SELECT * FROM RESTAURANT”, using JDBC. It then prints out these records (ID, Name, Address, City) on the Eclipse Console, as shown below. There should be 858 records in total, which can be seen at the end of the printed screen.


Option 2: Read data from the CSV File

In this option, IDE (integrated development environment) is used, which can be either a Java IDE or Python IDE, determined by the language that you prefer to use.

1. Java IDE:

In Java, we provide you with the data loader functionality in CSVLoader.java under Data package. The function restaurantLoader() in CSVLoader.java reads the restaurant information from the  CSV  file.  We  call  this  function  in three  files, namely NestedLoopByName.java,


NestedLoopByNameED.java and NestedLoopByNameJaccard.java. Therefore, you should enter these three files respectively and enable the CSV reader by uncommenting the following line:

Restaurant[] restaurants = CSVLoader.restaurantLoader("data\\restaurant.csv"); In the meantime, comment the option 1 part in each file, the result is shown as follows:


Be sure to perform the same process to all three files.

2. Python IDE:

In Python, we provide you with the data loader functionality in csv_loader.py. Since we also provide two data loading option in Python, we choose the data loading from CSV as default in nested_loop_by_name.py, nested_loop_by_name_ed.py and nested_loop_by_name_jaccard.py. You can switch between two options in the following code snippet.

Task 1: Read the code in NestedLoopByName.java/nested_loop_by_name.py and focus on the data loading part. Understand how data are loaded into restaurants array and complete the following data statistics tasks in the given class DataStatistics.java/data_statistics.py:

Count the number of restaurant records whose city is new york” and new york city”,

respectively.

Count total number of distinct values in city attribute (Hint: use HashSet in Java or set

in Python).

There are two ways of completing this task:

(1) Load data from Oracle database or CSV file to Restaurant[] restaurants” object using the method implemented in NestedLoopByName.java/ nested_loop_by_name.py, then obtain corresponding results by processing the “restaurants”, or

(2) Write SQL queries that retrieve the task results directly from database, send such query through JDBC/cx_Oracle (like the example shown in Example 1) and print the result to screen.

(Note: in Java, if you want to complete this task through SQL query, you may also change the ExecuteQuery(query) in DBConnection.java, as it is initially designed for parsing restaurant records to a Restaurant[] array. Writing a new parser for your new SQL is a better idea than changing ExecuteQuery(query) directly as it is called by other classes as well.)

Please screenshot both your code in DataStatistics.java/data_statistics.py and the running results. They should be included within one image. The format of the results is similar as follows (The actual value is not the same as the example):

Part 2: Measure the Performance of Data Linkage

There are some duplications in the original restaurant dataset, and we will use Data Linkage techniques to detect these duplications, i.e., pairs of records that refer to the same real-world entity. For example, the following two records actually represent the same restaurant:

5, “hotel bel-air”, “701 stone canyon rd.”, “bel air”

6, “bel-air hotel”, “701 stone canyon rd.”, “bel air”

1. Nested-Loop Join for Data Linkage

The nested-loop join, a.k.a nested iteration, uses one join input as the outer input table and the other one as the inner input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table. The pseudo-code below shows its workflow.

For each tuple r in R do

For each tuple s in S do

If r and s satisfy thejoin condition Then output the tuple <r,s>

In the simplest case, the search scans an entire table or index, which is called a naive nested loop join. In this case, the algorithm runs in O(|R|*|S|) time, where |R| and |S| are the number of tuples contained in tables R and S respectively and can easily be generalized to join any number of relations. Furthermore, the search can exploit an index as well, which is called an index nested loop join. A nested loop join is particularly effective if the outer input is small and the inner input is pre-indexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loop joins are superior to both merge joins and hash joins. In large queries, however, nested loop joins are often not the optimal choice.

In this Prac, we adopt a nested-loop method to self-join the RESTAURANT table for data linkage. We first consider perfect matching on the Name” attribute. In other words, we link two restaurant records (i.e., they refer to the same entity) only if their names are identical, e.g.,

1, arnie morton's of chicago”, “435 s. la cienega blv.”, “los angeles

2, “arnie morton's of chicago”, “435 s. la cienega blvd.”, “los angeles”

Program namely, NestedLoopByName.java/nested_loop_by_name.py is the implementation of this algorithm. It first reads all restaurant records from the Oracle database using JDBC, and then self-joins the table by a nested-loop join. If two records have the same Name” value, the algorithm outputs this linking pair, i.e., id1_id2 where id1 and id2 are the ID”s of these records respectively.

2. Precision, Recall, and F-measure

We can regard the problem of data linkage as a classification task. Specifically, for each pair of records r and s, we predict a binary class label: “0” or “1”, where “1” means we believe these two records refer to the same entity and hence can be linked. Naturally, a data linkage algorithm is perfect if and only if:

(1) Precision: all the linked pairs it predicts are correct, and

(2) Recall: all possible linked pairs are discovered.

We provide a file data\\restaurant_pair.csv” which stores the gold-standard linking results, i.e., all the restaurant record pairs that refer to the same real-world entity. Suppose that D represents the set of linked pairs obtained by the data linkage algorithm, and D* is the set of gold-standard linking results. The algorithm is regarded as perfect if the two sets are identical, i.e., D=D*.

Precision and Recall are  well-known  performance  measures  that  can  capture  the  above intuitions. For classification tasks, the terms true positives, true negatives, false positives, and false negatives are considered to compare the results of the classifier under test with trusted external judgments (i.e., gold-standard). The termspositive and negative refer to the classifier's prediction (sometimes known as the expectation), and the terms true and false refer to whether

that prediction corresponds to the external judgment (sometimes known as the observation), as shown below. Given a linked pair id1_id2 in this Prac, it is like:

True positive, if it belongs to both D and D*

False positive, if it only belongs to D

False negative, if it only belongs to D*

True negative, if it belongs to neither D nor D*

Based on these terms, precision and recall are defined as follows, where tp,fp, andfn represent true positive, false positive, and false negative, respectively.