Database Systems Homework 4

(Due by 10pm on May 11, 2021 to MyCourses Homework 4 Submission Folder)

 

You must include the following with your answer to this assignment and with your electronic signature (“Signed by Your_name” is acceptable).

 

“I have done this assignment completely on my own. I have not copied it, nor have I given my solution to anyone else. I understand that if I am involved in plagiarism or cheating I will have to sign an official form that I have cheated and that this form will be stored in my official university record. I also understand that I will receive a grade of 0 for the involved assignment and my grade will be reduced by one level (e.g., from A to A- or from B+ to B) for my first offense, and that I will receive a grade of “F” for the course for any additional offense of any kind.”

 

 

1. Suppose relation R has 10,000,000 tuples. The tuples are stored in consecutive storage spaces on disk. We would like to create a B+ tree index based on attribute A of R. Suppose the type of A is char(6), each address/pointer occupies 4 bytes, the size of each page on disk is 4KB (4,096 bytes), and each node in the B+ tree has an initial fill-factor of 75% (i.e., only 75% of each node/page in the B+ tree can be used initially, making the usable size of each page to be 4096 * 0.75 = 3072 bytes). Note that the 75% fill-factor is only used for nodes in the B+ tree, and pages used to store the tuples of R are all filled up to 100% capacity.

 

(a) (8 points) How many levels this B+ tree will have? Compute the number of nodes at each level. Show the basic computation steps.

 

(b) (6 points) Suppose your answer to question (a) is k (i.e., the B+ tree has k levels), what would be the maximum possible number of tuples this k-level B+ tree can accommodate?

 

(c) (8 points) Consider a selection condition “A > c”, where c is a constant. Suppose the number of distinct values greater than c under attribute A is 50. Suppose 1000 tuples in R satisfy this condition and each tuple occupies 200 bytes. What would be the maximum possible number of pages that need to be brought into the memory in order to find these 1000 qualified tuples if the B+ tree is a primary index? Please justify your answer. For this question, we assume that the B+ tree and the table R are entirely stored on disk initially. We also assume that indirection pages are used to handle repeating values and for simplicity, we assume 50 indirection pages are used for these 1000 qualified tuples, one for each distinct value of A that is greater than c.

 

(d) (5 points) The same question as in (c) except that this time the B+ tree is a secondary index.

 

2. (15 points) Consider the following SQL query: select A, B from R where B = 'b' and C = 'c', where R is a very large relation, A, B and C are attributes of R and b and c are two values under B and C, respectively. Suppose there is a primary index on B and a secondary index on C. Consider the following assumptions: (1) 1000 tuples satisfy condition B = ‘b’ and 20 tuples satisfy condition C = ‘c’; (2) the cost of a random IO is 10 times of that of a sequential I/O; (3) each data page can store 20 tuples of R; and (4) in the situation of secondary index, no two qualified tuples are in the same data page. Discuss how to determine which of the two indexes should be used to evaluate this query based on the above assumptions.

 

3. Consider the join R ∞R.A = S.B S, where R and S are two relations. Suppose R has 1,000 pages, S has 200 pages, and the memory buffer available for this join has 50 pages (excluding any buffer pages needed to handle the join results).

 

(a) (8 points) To minimize the number of I/O pages with rocking scan for the nested loop method, how should the memory buffer be allocated to R and S? Compute the number of I/O pages achieved by this method.

 

(b) (6 points) To minimize the number of I/O operations with rocking scan for the nested loop method, how should the memory buffer be allocated to R and S? Compute the number of I/O operations achieved by this method.

 

4. Consider the join R ¥R.A = S.B S, where R and S are two relations with 10000 pages and 6000 pages, respectively. Suppose the indexes on R.A and S.B are both primary indexes. Consider the following variations of the sort-merge join method:

 

(a) (8 points) What would be the number of I/O pages when no index is used to perform this join and why? For this question, we assume that at least one of the two joining attributes is a primary key.

 

(b) (8 points) Suppose this join will only yield 500 results. What would be the number of I/O pages needed in the worst case scenario if the B+tree indexes on R.A and S.B are used to perform the merge join? The worst case scenario happens when the 500 results are generated from 500 tuples in R and 500 tuples in S, and the matching tuples are all stored in different pages. For this question, we further assume that the bottom levels of the B+tree indexes on R.A and S.B have 1000 and 600 nodes (pages), respectively.

 

5. (28 points) Suppose there are three relations Departments(name, managerID, location), Projects(proj#, projname, budget, status) and Participate(deptname, proj#, startdate), where the primary key of each relation is underlined. Apply algebraic based optimization method to find an execution plan for the following query (you may assume that most projects have budget higher than $1.000.000 and very few projects are located in Binghamton):

 

select d.managerID

from Projects pr, Participate pa, Departments d

where pr.budget >= 1000000 and d.name = pa.deptname

      and pr.proj# = pa.proj# and d.location = 'Binghamton'

 

Show the initial query tree and the query tree after each optimization rule is applied. Show also the relational algebra expression corresponding to the fully optimized query tree.