cs5200

Problem Set 4


Submit a PDF file with your typeset answers and an SQL file with your load commands, queries and index creation commands. You may also include a PS4.data file if you use bulk loading for your data.

Database Use findopendata.com or your favorite website to find three large, joinable relational tables. Load into PostgreSQL. Your tables should be large (at least one with over 100,000 tuples). To understand what query plan is being used, PostgreSQL includes the EXPLAIN command. It prints the plan for a query, including all of the physical operators and access methods being used. You will need to create some indices on your databases.


Query Processing and Optimization

1. On your database write two queries in SQL of the following form using two different values of the literal x. Pick two values of x, call them x1 and x2, for which the optimizer picks different plans.

SELECT R.a

FROM R, S

WHERE R.b = S.c AND S.d > x;

(a) Draw the two query plans (with method used for each operator) for x1 and x2.

(b) Experiment with different values between x1 and x2 and find the transition point (the value of x) at which the optimizer changes the plan. Let’s call the crossover point x0. Explain in a sentence or two why the query plan changes. Why does it decide to switch plans at that specific value? Note, your tables need to be large enough for the optimizer to change plans and you may need to create/drop indices. State all indices you used. Please be as quantitative as possible in your explanation.

(c) Compare the actual running times corresponding to the two alternative query plans at the crossover point. How much do they differ? Inside psql, you can measure the query time by using the \timing command to turn timing on for all queries. To get accurate timing, you may also wish to redirect output to /dev/null, using the command \o /dev/null; you can stop redirection just by issuing the \o command without a file name. You may also wish to run each query several times, throwing out the first time (which may be longer as data is loaded into the buffer pool) and averaging the remaining runs.

(d) Based on your answers to the previous questions, is x0 actually the best place to switch plans, or is it an overestimate/underestimate of the best crossover point? If x0 is not the actual crossover point, can you estimate the actual best crossover point without running queries against the database? State assumptions underlying your answer, if any.


2. Write a query that includes a three-way join (join of three tables).

(a) On a database with no indices examine the query plan that PostgreSQL uses to answer the query. (Note you may need to drop constraints in order to drop an index.) Draw the query plan and explain why PostgreSQL picks this plan. In particular, it is important to explain the join order PostgreSQL has picked and why it is better than other join orders. Give the average runtime of this query over a few runs (again discarding the first).

(b) Now add one index that you think will be most advantageous for this query and that leads to a different query plan. Draw the query plan and explain why PostgreSQL picks this plan. Give the average runtime of this query over a few runs (again discarding the first) and explain why it is faster than the no index plan.