Project #1 - Query Optimizer Evaluation
Overview
The first programming project will provide hands-on experience with real query optimizers. You will use Apache Calcite to optimize SQL queries that are then executed on Calcite (via the Enumerable adapter) and DuckDB. There are three components to this project:
- Understanding the boundary between query optimization and query execution
- Heuristic query optimization: rules and how to apply them
- Cost-based query optimization: statistics and how to use them
By the end of this project, you should:
- Know how to use Apache Calcite to optimize SQL queries using either a heuristic or cost-based approach.
- Understand how to integrate Apache Calcite as a query optimizer for a different system.
- Understand the impact of statistics on query optimization.
- Understand and apply common query optimization rewrite rules to optimize logical expressions.
- Appreciate the performance difference obtained through good query optimization.
We expect you to have basic proficiency with navigating API documentation (e.g.,
Calcite Javadoc).
We provide Java starter code that uses
Gradle with the
Gradle Wrapper
to simplify setting up your development environment.
Your final deliverable is a Bash script optimize.sh
that
(1) reads a dataset and then (2) optimizes SQL statements for execution on our grading
infrastructure.
You are also required to submit a short Markdown
document reflection.md
that documents what you tried and what you found useful.
This is a conceptually heavy project that requires interaction with large real-world codebases. We encourage the use of AI coding assistance, but you are ultimately responsible for everything that you submit. We strongly advise you to start early!
This is a single-person project that will be completed individually (i.e., no groups).
- Release Date: Jan 22, 2025
- Due Date: Feb 28, 2025
Background
Make sure you understand the contents of the first two lectures before you start this project. Although not required, you may also benefit from reading the Calcite chapter in the EQOP Book (Section 3.3).
In this section, we will cover:
- The workloads that we are running.
- The system that we are running the workloads on (DuckDB).
- The framework that you will use to optimize the workloads' queries (Calcite).
We recommend skimming this section to get the high-level idea. Read it in more detail after you start implementing your project.
Workloads
The starter code provides a setup script
that downloads a tgz file containing a dataset.
The autograder may test your code on different datasets than the one provided,
but the dataset's structure will be the same.
The dataset consists of two folders data/
and queries/
.
Under the data/
folder, you will find:
- Table data formatted as Apache Parquet files,
- A
schema.sql
script that describes and creates all the necessary tables, and - A
load.sql
script that loads the Parquet files into said tables.
Under the queries/
folder, you will find SQL queries that you must optimize.
Each SQL query will be represented by a file foo.sql
,
and your corresponding output (details to follow) should be saved with the filename as a prefix
(e.g., foo_plan.json
, foo_optimized.sql
).
You wil be evaluated on a different dataset, so refrain from overfitting to the data or queries provided. You can use any technology you want for ingesting these Parquet files and computing statistics on them (i.e., you are allowed to use tools like pandas or DuckDB). Our grading infrastructure will run the workload on DuckDB.
DuckDB
DuckDB is an open-source analytical database management system. It is a single binary that you can download.
We will walk through basic DuckDB usage. First, we run DuckDB and have it persist data to a file named "test.db":
./duckdb test.db
Next, we generate some test data. We create table foo
:
CREATE TABLE foo (foo1 int, foo2 int);
Fill foo
with rows:
INSERT INTO foo SELECT i, i+15799 from generate_series(1, 100000) AS t(i);
Then we create table bar
:
create table bar (bar1 int, bar2 int);
And likewise fill bar
with rows:
INSERT INTO bar SELECT i, i+15799 FROM generate_series(1, 100000) AS t(i);
We then run ANALYZE to recompute table statistics:
ANALYZE;
At this point, we are done generating test data.
We can view a list of commands that DuckDB supports by executing .help
.
For example, we can view all the tables created with .tables
.
We can also start executing SQL queries.
We begin by using EXPLAIN to see how DuckDB
will execute the following SQL query:
EXPLAIN SELECT foo1, bar2 FROM foo, bar WHERE foo1 = bar1 and foo1 < 50 and bar2 > 25;
On our system, DuckDB's EXPLAIN output shows that it picked a hash join and pushed the projections
and filters down into the sequential scans.
We will now time the SQL query by executing .timer on
and then executing it without EXPLAIN:
SELECT foo1, bar2 FROM foo, bar WHERE foo1 = bar1 and foo1 < 50 and bar2 > 25;
On our machine, the real time reported was around 0.002s. Next, because we want to experiment with different optimizers in this project, we will disable DuckDB's optimizer entirely:
PRAGMA disable_optimizer;
Try running EXPLAIN and the actual SQL query again.
You should notice that (1) a worse plan is generated and (2) the plan takes much longer to run.
On our system, the run time increased from 0.002s to 8.959s.
You can cancel execution with Ctrl-C
.
In this project, one of our goals is to rewrite the SQL so that DuckDB will still choose a (somewhat) efficient plan even if its optimizer is disabled (in whole or in part). A real-world scenario for why this is useful is that it can allow you to improve a query plan without modifying the existing optimizer, which is often much more difficult.
Apache Calcite
Apache Calcite is an open-source framework that provides query parsing, validation, and optimization capabilities. Calcite is known for its modular and extensible query optimizer. However, it has limited support for executing queries directly. Thus, Calcite is often integrated by other projects as a frontend for its SQL parsing and query planning capabilities.
We will cover Calcite in class in Lecture #20.
Overview: SQL into relational algebra
We start by describing how Calcite transforms SQL.
Calcite's SqlParser converts a SQL string into a SqlNode. A SqlNode represents a SQL parse tree in Calcite. This parse tree can be "deparsed" (i.e., converted back to SQL) via its toSqlString method, which supports a large variety of SQL dialects.
The parse tree then needs to be validated (Calcite terminology; we called this a binder in class).
This requires Calcite to know about your schema.
Assuming that you loaded your data into a database system that supports JDBC,
JdbcSchema
provides a simple way to get started.
However, you may need to rethink your approach later to provide statistics (e.g., define your own
CalciteSchema).
Either way, SqlValidator
validates the parse tree and generates semantic information.
The validate
method returns a SqlNode
.
Calcite's optimizer primarily works with RelNode objects. A RelNode represents a relational algebra expression in Calcite. Calcite provides a sql2rel package that translates SQL parse trees to relational expressions. There is also a rel2sql package for going back from a RelNode to a SqlNode.
Summarizing, we have that
SQL (string) <--> SqlNode (AST, possibly validated) <--> RelNode (relational expression)
Optimizing Relational Algebra
Suppose we have obtained a RelNode from sql2rel's SqlToRelConverter.
RelOptUtil.dumpPlan
is useful for viewing its structure; we suggest invoking it with SqlExplainFormat.TEXT
and
SqlExplainLevel.ALL_ATTRIBUTES
during development.
If you were to dump the RelNode now, you would see that it currently represents an unoptimized
logical plan.
To optimize relational algebra expressions, Calcite has both a heuristic query optimizer (HepPlanner) and a cost-based query optimizer (VolcanoPlanner). These optimizers implement Calcite's RelOptPlanner interface. When creating a planner, you will need to register trait definitions (RelTraitDef) so that it knows about them during the optimization process.
Calcite comes with a large collection of planner rules for performing relational expression rewrites. For example, the FilterJoinRule pushes filters into a join node and/or its children nodes. Exploring these rules, and possibly implementing new ones, is a significant component of this project. You can start by examining the CoreRules.
Equipped with these rules, optimization can be performed by simply calling RelOptPlanner.findBestExp(). If the planner cannot find a plan (e.g., because you did not give it the necessary rules), it generally throws an informative exception that will help you to figure out what is missing.
At this point, it is important to understand Calcite's Convention. A Convention is a trait that represents the desired calling convention; loosely speaking, you can think of it as a compilation target for Calcite. For example, JdbcConvention specifies that operations should be pushed to JDBC where possible, enabling the rewrite of LogicalTableScan into JdbcTableScan. Therefore, the optimization result will depends on your planner's configured trait set.
In this project, we require you to optimize expressions into Calcite's EnumerableConvention.
The Enumerable
interface allows Calcite to iterate over data structures in Java (e.g., loop over an in-memory
List<Object[]>
).
With an admittedly hacky approach of reading all the data into memory first,
we will see later that this allows us to execute SQL queries from within Calcite itself.
Improving Optimization with Statistics
At the time of writing, Calcite defaults to assuming that tables contain 100 rows. Calcite exposes various mechanisms for augmenting the metadata that rules have access to, such as RelMdRowCount for obtaining the number of rows. The extensibility makes it easy to add new kinds of metadata, but it gets complicated fast. Thus, while you should be aware of Calcite's metadata handlers and statistics providers, we do not require you to use them (though you are welcome to).
Instead, one simple way of fixing the cardinality estimate is to extend AbstractTable. By doing so, you can override getStatistic() to return your own Statistic object, at which point you can override getRowCount() directly. If you go this route, you will want to implement ScannableTable so that Calcite can convert LogicalTableScan to EnumerableTableScan.
In this project, we expect you to fix table row counts as a bare minimum. For statistics, the reference solution only fixes row counts. We expect you to try a combination of rules and additional statistics to beat the reference solution.
Executing Optimized Plans
On Calcite
Given that a RelNode is in Enumerable convention, we can use a RelRunner
to execute it.
Note that the RelRunner should be instantiated by unwrapping the Connection
that has all the relevant schemas defined (i.e., connection.unwrap(RelRunner.class)
).
This allows us to obtain the ResultSet
that answers the original SQL query.
On Another System: Substrait
Although Calcite's Enumerable execution is one way of running the optimized plan, in practice, it is prohibitively expensive and infeasible to pull all of the data into memory. Instead, we would like Calcite to produce an optimized query plan that we can then run on another query engine, such as DuckDB or Apache DataFusion. The Substrait project works towards that goal by aiming to provide cross-language serialization for relational algebra. Practically speaking, Substrait plans provide a unified way of describing computation, which allows you to decouple your query optimizer from your query engine.
The Substrait Java project uses Calcite to convert SQL to Substrait plans in various formats (e.g., binary, JSON). These Substrait plans can then be fed to DuckDB or DataFusion and executed. However, at the time of writing, Substrait support is still in its early stages and has not reached full TPC-H support across these different systems. Given the early experimental status of these projects, we do not require you to support translating into Substrait. But you should definitely be aware it exists!
On DuckDB: SQL
Instead, to simplify this project, we will only require your query optimizer to output SQL. After disabling DuckDB's optimizer, we will then feed your optimized SQL to DuckDB directly.
You can convert your optimized RelNode back into SQL with rel2sql's RelToSqlConverter, followed by toSqlString. You should output SQL in a dialect that DuckDB understands.
Summary
You made it to the end of the background section! Here's a summary:
- We will test a variety of workloads, distributed as
.tgz
files. - We will run workloads on DuckDB with its optimizer disabled.
- We want you to use Calcite to optimize SQL queries into EnumerableConvention plans.
- We want you to execute those optimized EnumerableConvention plans on Calcite.
- We want you to deparse those optimized EnumerableConvention plans back into SQL so that we can run them on DuckDB.
- We expect you to beat the reference solution using better rules and/or better statistics.
Instructions
We provide starter code that sets up a build system and suggests a project structure.
Note: we provide starter code to try to reduce tedious aspects of the project. However, we expect you to be able to read, understand, debug, and modify all of the starter code.
Development Environment
This project only supports Ubuntu 22.04 (LTS) and OpenJDK 17. Use other environments at your own peril: since this is an advanced graduate course, we assume that you know what you are doing. Your code must work in a stock Ubuntu 22.04 (LTS) with OpenJDK 17 environment.
This project involves understanding code that you did not write. We found AI code assistance to be helpful in attempting this project, you may want to give that a try. However, you are still ultimately responsible for understanding everything that you submit.
Reflection
In the root of your submission, please include a file reflection.md
that describes:
- What you tried -- we will take effort into account when assigning final grades
- What worked
- (Optional) Feedback on this project (complexity, difficulty, duration, suggestions for future iterations)
You do not need to write much, bullet points are fine.
Roadmap
We suggest proceeding in this order.
- Create
reflection.md
and take notes as you go along. - Load and analyze the dataset.
- Create a schema for Calcite, don't worry about statistics for now.
- Get Calcite to parse SQL queries and print them to stdout.
- Enable the heuristic optimizer with a few rules, test that it works.
- Experiment with the rules.
- Implement support for going back to SQL.
- Support RelRunner execution.
- Enable the cost-based query optimizer.
- Print the cardinality of a table. If you get 100, you may be using the hardcoded default.
- Implement statistics (at least support actual table cardinalities).
- Experiment with the rules more, implement better statistics, etc.
Submission
To grade your submission, we will invoke:
./optimize.sh input_workload.tgz output_dir
For each query queries/foo.sql
in the input workload,
we expect your optimize.sh
to write the following files in output_dir/
:
foo.sql
: the original SQL queryfoo.txt
: the initial RelNode plan of the original SQL query, Logical, before any optimizationsfoo_optimized.txt
: the final optimized RelNode plan, Enumerable, after all your optimizationsfoo_results.csv
: the results of executing your optimized plan in Calcitefoo_optimized.sql
: your optimized plan deparsed into a SQL query
You are required to produce the TXT files with
RelOptUtil.dumpPlan("", relNode, SqlExplainFormat.TEXT, SqlExplainLevel.ALL_ATTRIBUTES)
You are required to use our provided starter code for serializing your ResultSet to CSV (feel free to copy-paste it into your code).
We provide a Makefile for
submitting to Gradescope.
Run make submit
to generate submission.zip
.
Important: Use the Gradescope course code announced on Piazza.
There is a Gradescope submission site available to non-CMU students (Entry Code: R7GNG2).
Grading Rubric
For each SQL query in your project submission, it will be graded in two phases:
Correctness
- Does the rewritten SQL query produce the same answer as the original?
Your implementation must pass this portion of the evaluation. If you fail any of these checks, then the submission is scored as zero. No partial credit will be given.
Performance
If your submission satisfies the correctness checks, then your points will further be adjusted based on the difference of the performance (latency) between your implementation and the TA's implementation.
Your final grade is based on how much faster/slower you are than our reference implementation (median of 10 runs, timeout at 5x slower than the reference implementation):
Difference | Your Grade |
---|---|
>125% | 110% |
115-124% | 100% |
105-114% | 90% |
95-104% | 80% |
85-94% | 70% |
75-84% | 60% |
<75% | 0% |
There is no submission limit. We may further adjust your final score based on your submission (e.g., more points for trying something impressive that didn't work out, fewer points for trying to hardcode answers).
Late Policy
25% will deducted from the final score of the project for every 24-hour period that the assignment is late.
Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.
Collaboration Policy
- Every student has to work individually on this assignment.
- Students are allowed to discuss high-level details about the project with others.
- Students are not allowed to copy the contents of a white-board after a group meeting with other students.
- Students are not allowed to copy the solutions from another colleague.
WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.