Posted on May 20, 18:55:21 CET 2016
This is the third (and last) summary of a research paper by Daniel Lemire and me about the use of 32bit integer compression for keys in a B+-tree index. We evaluate a wide range of compression schemes, some of which leverage the SIMD instructions available in modern processors. To our knowledge, such a comparison has never been attempted before.
upscaledb is a parameterized B+tree, and the memory layout and algorithms of the B+tree nodes differ depending on the parameters the user chose when the database was created. If integer compression is enabled, then a specific "Compressed Integer Blocks" implementation is selected, which is further parameterized for each of the supported codecs.
However, adding integer compression (or ANY key compression) does not just affect the B+tree nodes. Other more generic algorithms (those that are independent from the parameterization) also have to be changed, e.g. how you insert or delete keys from the B+tree. In this article I want to describe those changes and how upscaledb's B+tree differs from the textbook algorithm.
When we started adding key compression to upscaledb, we encountered a few major problems that we had to solve.
A textbook B+tree (or Btree, or any other derivate) defines its capacity as the number of keys that a node can store. Usually, a constant "k" is defined, and a node must store between k and 2k-1 items (otherwise it's underfilled and will be merged, or it overflows and will be split). (See Wikipedia for more information.)
But as soon as you add compression, there is no more "k" that you can use to describe the capacity. If you have a set of compressed keys, and you add a new one, you do not know in advance whether the key will fit into the available space or not. This depends on the codec, and on the "compressability" of the keys. When inserting a new key in a node with compressed keys, we found it therefore best to TRY the insertion. The codec would then decide whether the currently available space would be sufficient to store the new key, or not. If not, then the caller will attempt to reorganize the node in order to free up space. Only if this is impossible then the node is split.
This affects not just the B+tree node, but also the generic "insert" function (which is independent from the actual node format). It will simply attempt to insert the key, and split the node if insertion fails (and not if an arbitrary 2k-1 limit is reached).
If you recall the codecs from part 1, we implemented one codec which is not "delete stable": BP128 (SimdComp). It compresses integers by bit-packing their deltas. Imagine the following sequence:
1, 2, 3, 4, 5, 6, 7, 8 (this is typical for an auto-incremented primary key). After encoding, all deltas can be stored in a single byte with all bits set (
11111111). Now delete
4 and the delta of 3 and 5 will jump to 2. Each delta now requires 2 bits, and our storage grows to two bytes (the last two bits are unused):
01010110 01010100. In effect, our storage doubled although we deleted a key!
As a consequence, a B+tree node can run out of space if a key is deleted, and will require a node split.
This is such a weird scenario and so contrary to the textbook implementation that our first reaction was to avoid all codecs which behaved this way. Indeed, other researchers (like IBM for DB2) also had "delete stability" as a codec requirement. But after a while we discovered that it's possible to handle such cases - and changing the generic delete algorithm was fairly simple because it was very similar to the insert algorithm (see below).
Even before adding compression, upscaledb did not always follow the standard B+tree algorithm. Its insertion split the nodes on the way "down", when descending the tree, and not on the way "up", when returning from the recursion. This approach was first described by Guibas and Sedgewick. It's a much simpler approach. And deleting keys works similar: nodes are merged when "going down", not when going back up. Keys are never shifted. Underfilled nodes are accepted, and nodes are only merged when they are nearly empty, and if merging the node only has a local effect. If a merge would have to be propagated to more levels than just the parent, then the node will be left empty. This sounds wasteful, but is neglectable in practice.
The insert and delete operations are now very similar. In both cases the tree is descended, and internal nodes are merged or split if necessary. Only when the leaf is reached the operations diverge, and the new key is either inserted, or the old key is deleted.
If a split is necessary, a new node is allocated and the parent is updated. The split is never propagated because the parent node definitely has space for one more key (we made sure of that when descending the tree). The split only has a local effect.
It was definitely worth investing additional time to support BP128 (SimdComp) as a codec. BP128 offers the best compression ratio, especially for dense key ranges like auto-incremented primary keys or dense time stamps. Especially for analytical queries with upscaledb's query interface UQI.