Project #1 - Query Optimizer Evaluation

Overview

The first programming project will provide hands-on experience with real query optimizers. You will use Apache Calcite to optimize SQL queries that are then executed on Calcite (via the Enumerable adapter) and DuckDB. There are three components to this project:

  1. Understanding the boundary between query optimization and query execution
  2. Heuristic query optimization: rules and how to apply them
  3. Cost-based query optimization: statistics and how to use them

By the end of this project, you should:

We expect you to have basic proficiency with navigating API documentation (e.g., Calcite Javadoc). We provide Java starter code that uses Gradle with the Gradle Wrapper to simplify setting up your development environment. Your final deliverable is a Bash script optimize.sh that (1) reads a dataset and then (2) optimizes SQL statements for execution on our grading infrastructure. You are also required to submit a short Markdown document reflection.md that documents what you tried and what you found useful.

This is a conceptually heavy project that requires interaction with large real-world codebases. We encourage the use of AI coding assistance, but you are ultimately responsible for everything that you submit. We strongly advise you to start early!

This is a single-person project that will be completed individually (i.e., no groups).


Background

Make sure you understand the contents of the first two lectures before you start this project. Although not required, you may also benefit from reading the Calcite chapter in the EQOP Book (Section 3.3).

In this section, we will cover:

We recommend skimming this section to get the high-level idea. Read it in more detail after you start implementing your project.

Workloads

The starter code provides a setup script that downloads a tgz file containing a dataset. The autograder may test your code on different datasets than the one provided, but the dataset's structure will be the same. The dataset consists of two folders data/ and queries/.

Under the data/ folder, you will find:

Under the queries/ folder, you will find SQL queries that you must optimize. Each SQL query will be represented by a file foo.sql, and your corresponding output (details to follow) should be saved with the filename as a prefix (e.g., foo_plan.json, foo_optimized.sql).

You wil be evaluated on a different dataset, so refrain from overfitting to the data or queries provided. You can use any technology you want for ingesting these Parquet files and computing statistics on them (i.e., you are allowed to use tools like pandas or DuckDB). Our grading infrastructure will run the workload on DuckDB.

DuckDB

DuckDB is an open-source analytical database management system. It is a single binary that you can download.

We will walk through basic DuckDB usage. First, we run DuckDB and have it persist data to a file named "test.db":

./duckdb test.db

Next, we generate some test data. We create table foo:

CREATE TABLE foo (foo1 int, foo2 int);

Fill foo with rows:

INSERT INTO foo SELECT i, i+15799 from generate_series(1, 100000) AS t(i);

Then we create table bar:

create table bar (bar1 int, bar2 int);

And likewise fill bar with rows:

INSERT INTO bar SELECT i, i+15799 FROM generate_series(1, 100000) AS t(i);

We then run ANALYZE to recompute table statistics:

ANALYZE;

At this point, we are done generating test data. We can view a list of commands that DuckDB supports by executing .help. For example, we can view all the tables created with .tables. We can also start executing SQL queries. We begin by using EXPLAIN to see how DuckDB will execute the following SQL query:

EXPLAIN SELECT foo1, bar2 FROM foo, bar WHERE foo1 = bar1 and foo1 < 50 and bar2 > 25;

On our system, DuckDB's EXPLAIN output shows that it picked a hash join and pushed the projections and filters down into the sequential scans. We will now time the SQL query by executing .timer on and then executing it without EXPLAIN:

SELECT foo1, bar2 FROM foo, bar WHERE foo1 = bar1 and foo1 < 50 and bar2 > 25;

On our machine, the real time reported was around 0.002s. Next, because we want to experiment with different optimizers in this project, we will disable DuckDB's optimizer entirely:

PRAGMA disable_optimizer;

Try running EXPLAIN and the actual SQL query again. You should notice that (1) a worse plan is generated and (2) the plan takes much longer to run. On our system, the run time increased from 0.002s to 8.959s. You can cancel execution with Ctrl-C.

In this project, one of our goals is to rewrite the SQL so that DuckDB will still choose a (somewhat) efficient plan even if its optimizer is disabled (in whole or in part). A real-world scenario for why this is useful is that it can allow you to improve a query plan without modifying the existing optimizer, which is often much more difficult.

Apache Calcite

Apache Calcite is an open-source framework that provides query parsing, validation, and optimization capabilities. Calcite is known for its modular and extensible query optimizer. However, it has limited support for executing queries directly. Thus, Calcite is often integrated by other projects as a frontend for its SQL parsing and query planning capabilities.

We will cover Calcite in class in Lecture #20.

Overview: SQL into relational algebra

We start by describing how Calcite transforms SQL.

Calcite's SqlParser converts a SQL string into a SqlNode. A SqlNode represents a SQL parse tree in Calcite. This parse tree can be "deparsed" (i.e., converted back to SQL) via its toSqlString method, which supports a large variety of SQL dialects.

The parse tree then needs to be validated (Calcite terminology; we called this a binder in class). This requires Calcite to know about your schema. Assuming that you loaded your data into a database system that supports JDBC, JdbcSchema provides a simple way to get started. However, you may need to rethink your approach later to provide statistics (e.g., define your own CalciteSchema). Either way, SqlValidator validates the parse tree and generates semantic information. The validate method returns a SqlNode.

Calcite's optimizer primarily works with RelNode objects. A RelNode represents a relational algebra expression in Calcite. Calcite provides a sql2rel package that translates SQL parse trees to relational expressions. There is also a rel2sql package for going back from a RelNode to a SqlNode.

Summarizing, we have that

SQL (string) <--> SqlNode (AST, possibly validated) <--> RelNode (relational expression)

Optimizing Relational Algebra

Suppose we have obtained a RelNode from sql2rel's SqlToRelConverter. RelOptUtil.dumpPlan is useful for viewing its structure; we suggest invoking it with SqlExplainFormat.TEXT and SqlExplainLevel.ALL_ATTRIBUTES during development. If you were to dump the RelNode now, you would see that it currently represents an unoptimized logical plan.

To optimize relational algebra expressions, Calcite has both a heuristic query optimizer (HepPlanner) and a cost-based query optimizer (VolcanoPlanner). These optimizers implement Calcite's RelOptPlanner interface. When creating a planner, you will need to register trait definitions (RelTraitDef) so that it knows about them during the optimization process.

Calcite comes with a large collection of planner rules for performing relational expression rewrites. For example, the FilterJoinRule pushes filters into a join node and/or its children nodes. Exploring these rules, and possibly implementing new ones, is a significant component of this project. You can start by examining the CoreRules.

Equipped with these rules, optimization can be performed by simply calling RelOptPlanner.findBestExp(). If the planner cannot find a plan (e.g., because you did not give it the necessary rules), it generally throws an informative exception that will help you to figure out what is missing.

At this point, it is important to understand Calcite's Convention. A Convention is a trait that represents the desired calling convention; loosely speaking, you can think of it as a compilation target for Calcite. For example, JdbcConvention specifies that operations should be pushed to JDBC where possible, enabling the rewrite of LogicalTableScan into JdbcTableScan. Therefore, the optimization result will depends on your planner's configured trait set.

In this project, we require you to optimize expressions into Calcite's EnumerableConvention. The Enumerable interface allows Calcite to iterate over data structures in Java (e.g., loop over an in-memory List<Object[]>). With an admittedly hacky approach of reading all the data into memory first, we will see later that this allows us to execute SQL queries from within Calcite itself.

Improving Optimization with Statistics

At the time of writing, Calcite defaults to assuming that tables contain 100 rows. Calcite exposes various mechanisms for augmenting the metadata that rules have access to, such as RelMdRowCount for obtaining the number of rows. The extensibility makes it easy to add new kinds of metadata, but it gets complicated fast. Thus, while you should be aware of Calcite's metadata handlers and statistics providers, we do not require you to use them (though you are welcome to).

Instead, one simple way of fixing the cardinality estimate is to extend AbstractTable. By doing so, you can override getStatistic() to return your own Statistic object, at which point you can override getRowCount() directly. If you go this route, you will want to implement ScannableTable so that Calcite can convert LogicalTableScan to EnumerableTableScan.

In this project, we expect you to fix table row counts as a bare minimum. For statistics, the reference solution only fixes row counts. We expect you to try a combination of rules and additional statistics to beat the reference solution.

Executing Optimized Plans

On Calcite

Given that a RelNode is in Enumerable convention, we can use a RelRunner to execute it. Note that the RelRunner should be instantiated by unwrapping the Connection that has all the relevant schemas defined (i.e., connection.unwrap(RelRunner.class)). This allows us to obtain the ResultSet that answers the original SQL query.

On Another System: Substrait

Although Calcite's Enumerable execution is one way of running the optimized plan, in practice, it is prohibitively expensive and infeasible to pull all of the data into memory. Instead, we would like Calcite to produce an optimized query plan that we can then run on another query engine, such as DuckDB or Apache DataFusion. The Substrait project works towards that goal by aiming to provide cross-language serialization for relational algebra. Practically speaking, Substrait plans provide a unified way of describing computation, which allows you to decouple your query optimizer from your query engine.

The Substrait Java project uses Calcite to convert SQL to Substrait plans in various formats (e.g., binary, JSON). These Substrait plans can then be fed to DuckDB or DataFusion and executed. However, at the time of writing, Substrait support is still in its early stages and has not reached full TPC-H support across these different systems. Given the early experimental status of these projects, we do not require you to support translating into Substrait. But you should definitely be aware it exists!

On DuckDB: SQL

Instead, to simplify this project, we will only require your query optimizer to output SQL. After disabling DuckDB's optimizer, we will then feed your optimized SQL to DuckDB directly.

You can convert your optimized RelNode back into SQL with rel2sql's RelToSqlConverter, followed by toSqlString. You should output SQL in a dialect that DuckDB understands.

Summary

You made it to the end of the background section! Here's a summary:


Instructions

We provide starter code that sets up a build system and suggests a project structure.

Note: we provide starter code to try to reduce tedious aspects of the project. However, we expect you to be able to read, understand, debug, and modify all of the starter code.

Development Environment

This project only supports Ubuntu 22.04 (LTS) and OpenJDK 17. Use other environments at your own peril: since this is an advanced graduate course, we assume that you know what you are doing. Your code must work in a stock Ubuntu 22.04 (LTS) with OpenJDK 17 environment.

This project involves understanding code that you did not write. We found AI code assistance to be helpful in attempting this project, you may want to give that a try. However, you are still ultimately responsible for understanding everything that you submit.

Reflection

In the root of your submission, please include a file reflection.md that describes:

You do not need to write much, bullet points are fine.

Roadmap

We suggest proceeding in this order.

  1. Create reflection.md and take notes as you go along.
  2. Load and analyze the dataset.
  3. Create a schema for Calcite, don't worry about statistics for now.
  4. Get Calcite to parse SQL queries and print them to stdout.
  5. Enable the heuristic optimizer with a few rules, test that it works.
  6. Experiment with the rules.
  7. Implement support for going back to SQL.
  8. Support RelRunner execution.
  9. Enable the cost-based query optimizer.
  10. Print the cardinality of a table. If you get 100, you may be using the hardcoded default.
  11. Implement statistics (at least support actual table cardinalities).
  12. Experiment with the rules more, implement better statistics, etc.

Submission

To grade your submission, we will invoke:

./optimize.sh input_workload.tgz output_dir

For each query queries/foo.sql in the input workload, we expect your optimize.sh to write the following files in output_dir/:

You are required to produce the TXT files with

RelOptUtil.dumpPlan("", relNode, SqlExplainFormat.TEXT, SqlExplainLevel.ALL_ATTRIBUTES)

You are required to use our provided starter code for serializing your ResultSet to CSV (feel free to copy-paste it into your code).

We provide a Makefile for submitting to Gradescope. Run make submit to generate submission.zip.

Important: Use the Gradescope course code announced on Piazza.

There is a Gradescope submission site available to non-CMU students (Entry Code: R7GNG2).


Grading Rubric

For each SQL query in your project submission, it will be graded in two phases:

Correctness

Your implementation must pass this portion of the evaluation. If you fail any of these checks, then the submission is scored as zero. No partial credit will be given.

Performance

If your submission satisfies the correctness checks, then your points will further be adjusted based on the difference of the performance (latency) between your implementation and the TA's implementation.

Your final grade is based on how much faster/slower you are than our reference implementation (median of 10 runs, timeout at 5x slower than the reference implementation):

Difference Your Grade
>125% 110%
115-124% 100%
105-114% 90%
95-104% 80%
85-94% 70%
75-84% 60%
<75% 0%

There is no submission limit. We may further adjust your final score based on your submission (e.g., more points for trying something impressive that didn't work out, fewer points for trying to hardcode answers).


Late Policy

25% will deducted from the final score of the project for every 24-hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.

Collaboration Policy

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.