Project #1 - PostgreSQL Auto Tuner | CMU 15-799 :: Special Topics: Self-Driving Database Management Systems (Spring 2022)

Overview

The first programming project will provide hands-on experience with optimizing and tuning a database management system (DBMS). Your assigment is to build an automatic index tuning tool for PostgreSQL. The goal of the project is to familiarize yourself with the basic concepts of index selection for databases:

Capturing workload traces from a DBMS.
Analyze queries and extract salient aspects of their contents (e.g., columns referenced in WHERE clause).
Enumerate candidate indexes and estimate their cost/benefit.
Apply changes and measure the workload improvement.

This is a single-person project that will be completed individually (i.e., no groups).

Release Date: Jan 22, 2020
Due Date: Mar 16, 2020 @ 11:59pm

Environment Setup

We first provide you with instructions on how to setup your local development environment. To ensure that you have complete control over the DBMS and environment, we advise you to not use CMU's shared Andrew machines for this project.

Development Environment

You should start start with fresh installation of PostgreSQL (v14). It is available on every (reasonable) OS. For these instructions, we will assume that you are running on Ubuntu. You will want to use the official PostgreSQL package repository to ensure that you get the latest version of the DBMS.

sudo sh -c 'echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install postgresql-14

Next, you need to update the default knob configuration using a ruled-based tuner like PGConfig or PGTune. Append the recommended knob settings to the file /etc/postgresql/14/main/postgresql.conf and restart the DBMS:

sudo service postgresql restart

Benchmark Environment

Because you will need to evaluate your tool on the same hardware as other students in the course, we are providing every student with credits for Amazon AWS. You will be able to deploy a instance, install dependencies, and run your experiments.

You will want to use the m5d.xlarge instance type (4 vCPUs, 16GB of RAM). We picked this instance size because its limited amount of RAM means that the DBMS will not be able to keep the entire database (tables, indexes) entirely in memory for each workload. Therefore, you should see a noticeable difference in the DBMS's performance with the right configuration.

You will need to make sure that you monitor your credits so that you do not run out of money. Use spot instances to reduce your costs. Setup billing alerts to make sure that you are not charged for resources that you forgot to disable.

WARNING: We will not be able to reimburse you if you use more money than the provided credits.

Implementation Details

You will build an automated index selection tool for PostgreSQL. We are providing a fork of BenchBase with different workload scenarios. You are not allowed to modify the application code (i.e., BenchBase).

Setup BenchBase

Follow the instructions from the BenchBase README on how to install the software on your local machine.

Important: You need to use Andy's fork of BenchBase that contains the custom benchmarks for this project:

Workload Capture

For each workload that we will test your tool on, we will provide you with a query log generated by Postgres as the initial input to your tool. Each trace will be a CSV file that contains an entry for every query that the DBMS executes.

To evaluate your tool locally, you will need to capture a workload for yourself. The instructions below will generate a CSV file for the workload. You will need to do this for each workload to get a trace for each of them.

Modify the following knobs in your DBMS's postgresql.conf file:
```
log_destination = 'csvlog'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_statement = 'all'
```
This will enable PostgreSQL to record many different type of events to the log file (e.g., client connections). We will strip out this unwanted information during the analysis phase of the project. 2. Restart PostgreSQL and then verify that the DBMS creates the query log file in its data directory (/var/lib/postgresql/14/main):
```
sudo service postgresql restart
sudo ls -lah /var/lib/postgresql/14/main/log/ | grep csv
```
Then execute a query in PostgreSQL and confirm that the system captures the SQL. For example, execute this query on the commandline using psql:
```
pavlo=> SELECT 1 + 1;
?column? 
----------
2
(1 row)
```
You can then verify the log file has an entry that corresponds to the query that you invoked:
```
tail -n 1 /var/lib/postgresql/13/main/log/postgresql-2022-01-30_104723.csv
2022-01-30 10:57:25.630 EST,"pavlo","pavlo",368017,"[local]",61f6b45d.59d91,2,"idle",2022-01-30 10:53:01 EST,3/16,0,LOG,00000,"statement: SELECT 1 + 1;",,,,,,,,,"psql","client backend"
```
Once you have verified that PostgreSQL is correctly capturing queries in its log file, you can then execute a benchmark using BenchBase. Make sure you load the workload's database before you enable query logging. Otherwise the log file will contain the INSERT statements that BenchBase executes to load the database, which will taint your analysis of the workload. We also recommend that you limit the number of terminals in BenchBase to one to reduce the number of queries that execute.
Lastly, disable log collection by setting the knob logging_collector = off and restarting PostgreSQL.

You can then copy the CSV trace file from PostgreSQL's data directory to your home directory for your analysis tool.

Make sure that you disable PostgreSQL's query log collection when running benchmarks to measure performance.

Additional Resources:

Index Tuning Tool

Your index tuning tool will automatically generate index recommendations for PostgreSQL for a given workload. We are providing you with a testing harness to build your tuning tool. The harness takes in as its input the path to a PostgreSQL workload trace CSV file and connection information to the target database. The harness runs the same workload repeatedly in a loop to allow your selection algorithm to refine its choices.

Although the harness is written in Python, you are not required to implement your selection algorithm in Python. That is, you can have the test harness invoke external programs written in other languages for your computations. You can make any changes that you want to the database you want as long as it is through SQL commands, but you are not allowed to modify the BenchBase code. The test harness will report whenever a query in the workload fails to execute successfully.

We recommend that you start with index tuning. There are no restrictions on the number or type of indexes you are allowed build as long as PostgreSQL supports it (e.g, BTREE, HASH, BRIN). We also encourage you to consider custom knob options per index via the WITH clause (i.e., PostgreSQL Index Storage Parameters). Your tool is also allowed to drop any existing indexes as long they are not marked as PRIMARY KEY or UNIQUE.

Beyond index tuning, you can also modify the table schema or change configuration knobs.

At the beginning of each iteration, your tool will generate a list of DDL changes with the indexes to add or drop to the database Each entry in these DDL changes will contain SQL commands that the harness executes in sequential order. The harness will ignore empty entries or entries that start with the # character.

The harness will then perform the following steps to update the DBMS:

Reset the database to its original state (this removes any previous changes to the database schema and configuration).
Apply the physical design changes. If any change fails, the iteration is aborted and an exception is thrown.
Run VACUUM FULL if requested. This operation can take minutes, so you only want to do this if you modify the schema. If you are just changing indexes, then you can skip this step.
Reset the DBMS's internal metrics (via pg_stat_reset)
Execute the workload and return the results.

After executing the workload, the tool will then return back the goodput of the workload (i.e., the number of successfully completed transactions), along with additional histograms about transaction execution statuses and latencies. Your tool can also connect to the database and retrieve additional DBMS metrics that you want (e.g., pg_stat_statements).

Provided Workloads

All the workloads in BenchBase already have the ideal indexes selected for them. Thus, as a sanity check for your tool, you can analyze their workloads and see that you pick the same secondary indexes that already exist in their original DDL files. Note that the first workload listed below is an example of this setup.

Beyond those existing benchmarks, we are also providing you with the following workloads that are designed to challenge your index tuning tool. We will also be using additional workloads to test your implementation.

Workload #1: Starter

This is a variation of the ePinions benchmark but with all the indexes removed and additional text fields to increase the size of each tuple (e.g., useracct.email, item.description).

Config File: config/postgres/15799_starter_config.xml

Workload #2: Index Jungle

The database for this workload contains many superflous indexes. The goal is to figure out the mininal set of indexes. The challenge is going to be that some indexes will end up being used by some queries but the cost of maintaining additional indexes is high. The workload is comprised of 50% reads / 50% updates.

Workload #3: Timeseries Range Query

This is a read-only workload contains a single SELECT query that retrieves a range of data from a timeseries dataset. The existing indexes were created by an ORM.

Config File: config/postgres/15799_timeseries_config.xml

Workload #4: Needle in Haystack

The database contains a table with text scrapped from web pages. The application's workload is comprised of two phases. In the first phase, it executes queries that search for entries in the table that contain certain URLs. It then alternates to a second phase, where it bulk loads more records into this table as part of a batch crawl. The workload then repeats these two phases for a fixed number of intervals.

Grading Rubric

Each project submission will be graded in two phases.

Correctness

To be determined.

Performance

To be determined.

Late Policy

25% will deducted from the final score of the project for every 24-hour period that the assignment is late.

Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.

Submission

The grading script relies on the doit framework. After cloning your GitHub repo, we will invoke your tuning tool with the command doit project1. The output of your tuning tool should be two files: actions.sql and config.json. The former specifies the SQL commands to be applied to the DBMS, and the latter indicates any additional work to perform by the test harness. Currently the only supported configuration is whether to execute VACUUM FULL; after applying your actions specified via {"VACUUM": true}.

A sample dodo.py for use with the doit framework is provided below:

def task_project1():
    return {
        # A list of actions. This can be bash or Python callables.
        "actions": [
            'echo "Faking action generation."',
            'echo "SELECT 1;" > actions.sql',
            'echo "SELECT 2;" >> actions.sql',
            'echo \'{"VACUUM": true}\' > config.json',
        ],
        # Always rerun this task.
        "uptodate": [False],
    }

Collaboration Policy

Every student has to work individually on this assignment.
Students are allowed to discuss high-level details about the project with others.
Students are not allowed to copy the contents of a white-board after a group meeting with other students.
Students are not allowed to copy the solutions from another colleague.

WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.

Last Updated: Feb 23, 2022