Tuesday, June 30, 2015

Some Database Indexing Strategies

If you program anything at all, you probably use some sort of database for storing your data. When it comes to performance, index usage is key; not many people will argue that point. There are, however, many different ways to create indexes. In this post, I investigate 3 different indexing strategies to establish some guidelines. I'll be using the excellent PostgreSQL. Most of what I'll discuss is relevant to any system that uses indexes (even the trendy NoSQL systems) but you should test to ensure the behavior is identical before committing to a particular strategy.

Starting with some basic information, I'll be investigating 3 indexing strategies against the following table:

CREATE TABLE test_table (
  s1_c1 BIGINT NOT NULL,
  s1_c2 BIGINT NOT NULL,
  s1_c3 BIGINT NOT NULL,
  s1_c4 BIGINT NOT NULL,
  s1_c5 BIGINT NOT NULL,
  s2_c1 BIGINT NOT NULL,
  s2_c2 BIGINT NOT NULL,
  s2_c3 BIGINT NOT NULL,
  s2_c4 BIGINT NOT NULL,
  s2_c5 BIGINT NOT NULL,
  s3_c1 BIGINT NOT NULL,
  s3_c2 BIGINT NOT NULL,
  s3_c3 BIGINT NOT NULL,
  s3_c4 BIGINT NOT NULL,
  s3_c5 BIGINT NOT NULL
);

The names are significant: each column that starts with s1_ is used to test strategy 1, s2_ for strategy 2, etc. Once the table was created, I populated it with approximately eleventy jillion rows of random data. However, the data is created in such a way that s1_c1 == s2_c1 == s3_c1 and so on for each of the _c* columns. This ensures equal data distributions and fairness across the tests.

I'll be using a query of the following form for tests:

SELECT s?_c5
  FROM test_table
  WHERE s?_c1 = ?
    AND s?_c2 = ?
    AND s?_c3 = ?
    AND s?_c4 = ?;

This roughly simulates a common situation in a database, where you're selecting data based on a complex set of criteria. A far more common situation is where you're selecting all the columns of a table based on a single primary key column, but that case really only has a single indexing strategy available, and is therefore not interesting to this post.

If you're wondering why I didn't tell you exactly how many rows I populated with, it's because the impact of that size is going to vary depending on your hardware: one million rows will be a significant amount of data on one computer, while it's a trivial amount of data on another computer. Suffice to say that the number of rows is large enough that on my testing computer, it takes an average of 225 seconds to run the test query when no indexes are involved. Any manager I've ever worked for would complain about a query that slow, so we'll need to add some indexes.

Strategy 1: Individual Indexes

The first strategy to investigate is placing an individual index on each of the columns specified in the WHERE clause, like this:

CREATE INDEX s1_i1 ON test_table(s1_c1);
CREATE INDEX s1_i2 ON test_table(s1_c2);
CREATE INDEX s1_i3 ON test_table(s1_c3);
CREATE INDEX s1_i4 ON test_table(s1_c4);

This helps our test query quite a bit, improving the average speed to 21 seconds, an improvement of over 10x.

Stragety 2: Composite Index

PostgreSQL (like most database systems) allows you to create a single index composed of multiple columns, called a composite index. Creating one to help our test query looks like this:

CREATE INDEX s2_i1 ON test_table(s2_c1, s2_c2, s2_c3, s2_c4);

This improves performance even further, bringing the average speed down to about 50ms (yes, that's milliseconds, or 0.05s)

To understand why the composite index is so much faster, it helps to look at Postgres' EXPLAIN output. For a query using indexing strategy 1, we get this explain:

Bitmap Heap Scan on test_table (cost=3304.35..3755.22 rows=1 width=8)
  Recheck Cond: ((s1_c1 = 936) AND (s1_c2 = 196))
  Filter: ((s1_c3 = 49) AND (s1_c4 = 94))
  -> BitmapAnd (cost=3304.35..3304.35 rows=113 width=0)
    -> Bitmap Index Scan on s1_i1 (cost=0.00..1082.30 rows=58365 width=0)
      Index Cond: (s1_c1 = 936)
    -> Bitmap Index Scan on s1_i2 (cost=0.00..2221.80 rows=120164 width=0)
      Index Cond: (s1_c2 = 196)

What this means is that PostgreSQL searches two of the four possible indexes and generates a result bitmap, then combines the two result bitmaps into a list of rows that match both conditions, and scans through those rows to filter out the final result. While this seems overly complex, whether or not it's the most efficient method depends on the distribution of the data in the columns, which Postgres determines when ANALYZing the table. Other database systems will have similar approaches to finding the fastest plan, but in the case of multiple indexes, they'll all include multiple steps to narrow down the result to its final form.

Now, with a query using the composite index, PostgreSQL is able to find matching rows in a single step:

Index Scan using s2_i1 on test_table (cost=0.56..8.59 rows=1 width=8)
Index Cond: ((s2_c1 = 469) AND (s2_c2 = 326) AND (s2_c3 = 132) AND (s2_c4 = 123))

While a single operation isn't always faster than multiple operations, in this particular case it's much faster. This degree of speedup is fairly typical and is what makes composite indexes so powerful.

Composite indexes have some unfortunate limitations, however. The index data is stored in the order you specify in the CREATE INDEX statement, and it can only be accessed from left to right. For example, if we want to do a query on the table WHERE s2_c2 = ?, the database would be unable to use the index, and we'd have to wait 200+s for a complete table scan. With strategy 1, the database could use index s1_i2 and we'd still get decent performance.

We could create additional indexes, indexing each column more than once. In many cases, this is a good strategy, but remember the two downsides to indexes: they take up disk space and they slow down inserts and updates. How much space? Well, each of the indexes from strategy 1 uses about 1.4G, for a total of about 5.6G; while the composite index for strategy 2 uses a little over 3G. How much indexes slow down inserts and updates is a complicated question, and a good subject for a future post. For now, suffice it to say that you should carefully test the impact before adding lots of indexes to your production servers.

Strategy 3: Covering Index

When even a composite index isn't fast enough, I can go one step further and make what's called a covering index:

CREATE INDEX s3_i1 ON test_table(s3_c1, s3_c2, s3_c3, s3_c4, s3_c5);

This speeds up our query to about 0.5ms (yes, that's ms ... meaning 0.0005 seconds). The EXPLAIN output shows that Postgres is now doing an "Index Only Scan" which is very similar to the index scan in strategy 2, with one important difference: in a normal index scan, Postgres uses the index to find row IDs that match the WHERE condition. Once it's found them, it has to look at the actual row data to find the requested data (column s?_c5 in our examples).

With a covering index, that column is also included in the index data itself, so Postgres can skip the step of fetching the data from the table, which (in this case) improves performance by about 100x.

Covering indexes have all the same disadvantages of non-covering composite indexes, except that they're even less flexible. If we want to change our test query to return column s1_c5 as well, the index is no longer covering and we either have to live with the performance hit or make a new index that covers s1_c5 too. Since there are additional columns in the index, it's larger on disk and is more likely to impact insert or update performance.

Conclusion?

Indexing is too complicated to address in a single post, so I attempted to narrow the focus of this post to a review of the benefits of three common strategies. If you don't have tens of millions of rows in your database, you might find that almost any indexing strategy you use yields good results, but if you're hard pressed to get the response time you want, the information in this post should help you plan your strategy. Of course, the differences in data from one application to another make it difficult to make blanket statements, so you should test different approaches and see what works best for you.

In general, however, covering indexes are usually a good approach when you have a few queries that need to be very fast that can be serviced by a covering index; and the table isn't seeing a lot of insert and update activity -- or the speed of those modifications is less critical than the SELECT queries.

Individual indexes are best when there are a wide variety of queries that need to be serviced and it's difficult to create a few covering indexes that service all of them. Composite (but non-covering) indexes fall somewhere in between.

For complicated situations, a combination of these techniques will probably be required.

In any case where INSERT/UPDATE/DELETE performance is critical, you should carefully test to ensure that each index you create is worthwhile; since each index slows down modification queries a bit. The actual scale of that impact will be the subject of another post.