Overview
The first programming project will provide hands-on experience with optimizing and tuning a database management system (DBMS). Your assigment is to build an automatic index tuning tool for PostgreSQL. The goal of the project is to familiarize yourself with the basic concepts of index selection for databases:
- Capturing workload traces from a DBMS.
- Analyze queries and extract salient aspects of their contents (e.g., columns referenced in
WHERE
clause). - Enumerate candidate indexes and estimate their cost/benefit.
- Apply changes and measure the workload improvement.
This is a single-person project that will be completed individually (i.e., no groups).
- Release Date: Jan 22, 2020
- Due Date: Mar 16, 2020 @ 11:59pm
Environment Setup
We first provide you with instructions on how to setup your local development environment. To ensure that you have complete control over the DBMS and environment, we advise you to not use CMU's shared Andrew machines for this project.
Development Environment
You should start start with fresh installation of PostgreSQL (v14). It is available on every (reasonable) OS. For these instructions, we will assume that you are running on Ubuntu. You will want to use the official PostgreSQL package repository to ensure that you get the latest version of the DBMS.
sudo sh -c 'echo "deb [arch=amd64] http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install postgresql-14
Next, you need to update the default knob configuration using a ruled-based tuner like PGConfig or PGTune. Append the recommended knob settings to the file /etc/postgresql/14/main/postgresql.conf
and restart the DBMS:
sudo service postgresql restart
Benchmark Environment
Because you will need to evaluate your tool on the same hardware as other students in the course, we are providing every student with credits for Amazon AWS. You will be able to deploy a instance, install dependencies, and run your experiments.
You will want to use the m5d.xlarge instance type (4 vCPUs, 16GB of RAM). We picked this instance size because its limited amount of RAM means that the DBMS will not be able to keep the entire database (tables, indexes) entirely in memory for each workload. Therefore, you should see a noticeable difference in the DBMS's performance with the right configuration.
You will need to make sure that you monitor your credits so that you do not run out of money. Use spot instances to reduce your costs. Setup billing alerts to make sure that you are not charged for resources that you forgot to disable.
WARNING: We will not be able to reimburse you if you use more money than the provided credits.
Implementation Details
You will build an automated index selection tool for PostgreSQL. We are providing a fork of BenchBase with different workload scenarios. You are not allowed to modify the application code (i.e., BenchBase).
Setup BenchBase
Follow the instructions from the BenchBase README on how to install the software on your local machine.
Important: You need to use Andy's fork of BenchBase that contains the custom benchmarks for this project:
Workload Capture
For each workload that we will test your tool on, we will provide you with a query log generated by Postgres as the initial input to your tool. Each trace will be a CSV file that contains an entry for every query that the DBMS executes.
To evaluate your tool locally, you will need to capture a workload for yourself. The instructions below will generate a CSV file for the workload. You will need to do this for each workload to get a trace for each of them.
-
Modify the following knobs in your DBMS's
postgresql.conf
file:
log_destination = 'csvlog' logging_collector = on log_directory = 'log' log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' log_statement = 'all'
This will enable PostgreSQL to record many different type of events to the log file (e.g., client connections). We will strip out this unwanted information during the analysis phase of the project. 2. Restart PostgreSQL and then verify that the DBMS creates the query log file in its data directory (
/var/lib/postgresql/14/main
):sudo service postgresql restart sudo ls -lah /var/lib/postgresql/14/main/log/ | grep csv
Then execute a query in PostgreSQL and confirm that the system captures the SQL. For example, execute this query on the commandline using
psql
:pavlo=> SELECT 1 + 1; ?column? ---------- 2 (1 row)
You can then verify the log file has an entry that corresponds to the query that you invoked:
tail -n 1 /var/lib/postgresql/13/main/log/postgresql-2022-01-30_104723.csv 2022-01-30 10:57:25.630 EST,"pavlo","pavlo",368017,"[local]",61f6b45d.59d91,2,"idle",2022-01-30 10:53:01 EST,3/16,0,LOG,00000,"statement: SELECT 1 + 1;",,,,,,,,,"psql","client backend"
-
Once you have verified that PostgreSQL is correctly capturing queries in its log file, you can then execute a benchmark using BenchBase. Make sure you load the workload's database before you enable query logging. Otherwise the log file will contain the
INSERT
statements that BenchBase executes to load the database, which will taint your analysis of the workload. We also recommend that you limit the number of terminals in BenchBase to one to reduce the number of queries that execute. -
Lastly, disable log collection by setting the knob
logging_collector = off
and restarting PostgreSQL.
You can then copy the CSV trace file from PostgreSQL's data directory to your home directory for your analysis tool.
Make sure that you disable PostgreSQL's query log collection when running benchmarks to measure performance.
Additional Resources:
- How to Get the Best Out of PostgreSQL Logs
- How to Log Queries in PostgreSQL
- How to Start Logging with PostgreSQL
Index Tuning Tool
Your index tuning tool will automatically generate index recommendations for PostgreSQL for a given workload. We are providing you with a testing harness to build your tuning tool. The harness takes in as its input the path to a PostgreSQL workload trace CSV file and connection information to the target database. The harness runs the same workload repeatedly in a loop to allow your selection algorithm to refine its choices.
Although the harness is written in Python, you are not required to implement your selection algorithm in Python. That is, you can have the test harness invoke external programs written in other languages for your computations. You can make any changes that you want to the database you want as long as it is through SQL commands, but you are not allowed to modify the BenchBase code. The test harness will report whenever a query in the workload fails to execute successfully.
We recommend that you start with index tuning. There are no restrictions on the number or type of indexes you are allowed build as long as PostgreSQL supports it (e.g, BTREE
, HASH
, BRIN
). We also encourage you to consider custom knob options per index via the WITH
clause (i.e., PostgreSQL Index Storage Parameters). Your tool is also allowed to drop any existing indexes as long they are not marked as PRIMARY KEY
or UNIQUE
.
Beyond index tuning, you can also modify the table schema or change configuration knobs.
At the beginning of each iteration, your tool will generate a list of DDL changes with the indexes to add or drop to the database
Each entry in these DDL changes will contain SQL commands that the harness executes in sequential order. The harness will ignore empty entries or entries that start with the #
character.
The harness will then perform the following steps to update the DBMS:
- Reset the database to its original state (this removes any previous changes to the database schema and configuration).
- Apply the physical design changes. If any change fails, the iteration is aborted and an exception is thrown.
- Run
VACUUM FULL
if requested. This operation can take minutes, so you only want to do this if you modify the schema. If you are just changing indexes, then you can skip this step. - Reset the DBMS's internal metrics (via
pg_stat_reset
) - Execute the workload and return the results.
After executing the workload, the tool will then return back the goodput of the workload (i.e., the number of successfully completed transactions), along with additional histograms about transaction execution statuses and latencies.
Your tool can also connect to the database and retrieve additional DBMS metrics that you want (e.g., pg_stat_statements
).
Provided Workloads
All the workloads in BenchBase already have the ideal indexes selected for them. Thus, as a sanity check for your tool, you can analyze their workloads and see that you pick the same secondary indexes that already exist in their original DDL files. Note that the first workload listed below is an example of this setup.
Beyond those existing benchmarks, we are also providing you with the following workloads that are designed to challenge your index tuning tool. We will also be using additional workloads to test your implementation.
Workload #1: Starter
This is a variation of the ePinions benchmark but with all the indexes removed and additional text fields to increase the size of each tuple (e.g., useracct.email
, item.description
).
Config File: config/postgres/15799_starter_config.xml
Workload #2: Index Jungle
The database for this workload contains many superflous indexes. The goal is to figure out the mininal set of indexes. The challenge is going to be that some indexes will end up being used by some queries but the cost of maintaining additional indexes is high. The workload is comprised of 50% reads / 50% updates.
Workload #3: Timeseries Range Query
This is a read-only workload contains a single SELECT
query that retrieves a range of data from a timeseries dataset. The existing indexes were created by an ORM.
Config File: config/postgres/15799_timeseries_config.xml
Workload #4: Needle in Haystack
The database contains a table with text scrapped from web pages. The application's workload is comprised of two phases. In the first phase, it executes queries that search for entries in the table that contain certain URLs. It then alternates to a second phase, where it bulk loads more records into this table as part of a batch crawl. The workload then repeats these two phases for a fixed number of intervals.
Grading Rubric
Each project submission will be graded in two phases.
Correctness
To be determined.
Performance
To be determined.
Late Policy
25% will deducted from the final score of the project for every 24-hour period that the assignment is late.
Only in extreme circumstances (e.g., medical emergencies) no-penalty extensions will be granted. The student is required to provide written documentation from the University health center. Please contact the instructor if you have any questions.
Submission
The grading script relies on the doit framework. After cloning your GitHub repo, we will invoke your tuning tool with the command doit project1
. The output of your tuning tool should be two files: actions.sql
and config.json
. The former specifies the SQL commands to be applied to the DBMS, and the latter indicates any additional work to perform by the test harness. Currently the only supported configuration is whether to execute VACUUM FULL;
after applying your actions specified via {"VACUUM": true}
.
A sample dodo.py
for use with the doit
framework is provided below:
def task_project1():
return {
# A list of actions. This can be bash or Python callables.
"actions": [
'echo "Faking action generation."',
'echo "SELECT 1;" > actions.sql',
'echo "SELECT 2;" >> actions.sql',
'echo \'{"VACUUM": true}\' > config.json',
],
# Always rerun this task.
"uptodate": [False],
}
Collaboration Policy
- Every student has to work individually on this assignment.
- Students are allowed to discuss high-level details about the project with others.
- Students are not allowed to copy the contents of a white-board after a group meeting with other students.
- Students are not allowed to copy the solutions from another colleague.
WARNING: All of the code for this project must be your own. You may not copy source code from other students or other sources that you find on the web. Plagiarism will not be tolerated. See CMU's Policy on Academic Integrity for additional information.