Understanding the Nuance of Compaction in Apache Cassandra

Mon, Jan 1, 0001
6-minute read

Compaction in Apache Cassandra isn’t usually the first (or second) topic that gets discussed when it’s time to start optimizing your system. Most of the time we focus on data modeling and query patterns. An incorrect data model can turn a single query into hundreds of queries, resulting in increased latency, decreased throughput, and missed SLAs. If you’re using spinning disks the problem is magnified by time consuming disk seeks.

That said, compaction is also an incredibly important process. Understanding how a compaction strategy complements your data model can have a significant impact on your application’s performance. For instance, in Alex Dejanovski’s post on TimeWindowCompactionStrategy, he shows how a simple change to the compaction strategy can significantly decrease disk usage. As he demonstrated, a cluster mainly concerned with high rates of TTL’ed time series data can achieve major space savings and significantly improved performance. Knowing how each compaction strategy works in detail will help you make the right choice for your data model and access patterns. Likewise, knowing the nuance of compaction in general can help you understand why the system isn’t behaving as you’d expect when there’s a problem. In this post we’ll discuss some of the nuance of compaction, which will help you better know your database.

First, let’s do a quick recap of how data is written. When a write comes in, it’s written to the commit log, and to the active Memtable for the table. Memtables are later flushed to disk, and that file is called an SSTable. SSTables are immutable - the data contained is never updated or deleted in place. Instead of updating the data in place, we write our changes to a new Memtable, and then a new SSTable. The compaction process merges the SSTables together. If there was an update or a delete, the newest value for the field is kept by compaction and is written to the new SSTable, and the older versions are discarded.

It’s helpful to understand how compactions are run, and how they are kicked off in the first place. Each minor compaction is started by the org.apache.cassandra.db.compaction.CompactionManager#submitBackground() method on the CompactionManager singleton instance. This can be the result of a few different events. Most commonly, we’ll see a compaction start as the result of a Memtable being written to disk.

Oddly enough the manager doesn’t pick the SStables before the thread starts working. The SSTables are chosen by the Compaction Strategy after the Thread has already started. This is where the nuance begins. Due to the fact that SSTables aren’t chosen till the compaction runs reveals something interesting: compaction is not managed as a queue. This seems a little non-intuitive. When you nodetool compactionstats, you’ll see something like:

pending tasks: 25

It’s very common to think of pending tasks is some backlog of compactions that are waiting to happen. Technically, the number you see for pending tasks is an estimation of the work that needs to be done. The total is a summation of all the estimates for every table, and it’s specific to each compaction strategy. In Cassandra 3.2, the estimate is printed out per table rather than simply in aggregate. This behavior is far more useful the the number in total, from a diagnostic standpoint. See CASSANDRA-10718.

Running low on disk space has always been problematic with Cassandra. Databases obviously need disk space, and Cassandra is no different. With compaction, we need enough space while we’re performing a compaction for the original files as well as the new ones. This can add up to a significant amount on very dense nodes. This check happens before a compaction starts and is relatively simple. It looks at the the total space used by all sstables, adds it up, and checks it against available disk space. If there’s enough space at that moment, the compaction can begin. For better or worse, it doesn’t take into account space that will be used by other compactions that are already running. This can sometimes cause 2 compactions to compete over disk space.

As a simple example, if we only had a 10GB drive, with two 4GB files, the check would fail and we would not be able to compact the two together. When a compaction fails in this manner, it throws an exception and silently fails. When compactions start failing in this manner, we don’t see pending tasks decrease. As SSTables are written to disk, we see compaction tasks continue to fail and pending tasks increase.

In a more complicated example, lets suppose we have a little over 100GB of total space, and we have seven SStables that are 10GB each. If we try to perform a compaction across all of them, we won’t have enough space to finish the compaction. The CompactionTask would require 70GB of free space when we only have 30GB left, so we reasonably do not start a new task with all the SSTables. As a workaround we can logically remove SStables from the task until we know we have enough space to complete the compaction. Doing that would allow us to perform smaller compactions, freeing up the space until we’ve reclaimed enough disk space to handle the larger ones. In the Cassandra 3.0 branch that logic is found in org.apache.cassandra.db.compaction.CompactionTask#checkAvailableDiskSpace(). Unfortunately, that has only just started working correctly in versions 2.2.9, 3.0.11, 3.10, and 4.0.

In versions prior to these, if you’re running low on disk space, compactions will start and immediately fail, silently. This is due to a miscalculation when checking how much space will be used by the compaction task after SSTables have been removed. This bug can be confusing, since it will appear that compactions have simply stopped running altogether. The nuance is that they’re running, but immediately aborting, not telling the operator. To learn more, see CASSANDRA-12979. Additionally, we contributed CASSANDRA-13015 to expose JMX metrics around failing compactions as well as compactions which have had to drop SStables due to limited disk space. This will help operators know exactly how much impact their disks have had on Cassandra’s ability to perform compaction.

The only way to work around the issue in unfixed versions is to force user defined compactions on tables which you know will free up space. This is a time consuming and tedious process, and we strongly recommend updating to a version of Cassandra with the CASSANDRA-12979 fix if you’re going to run dense nodes with close to full disks.

Hopefully at this point you can see how some design choices of compaction can have an impact on a production server. In later posts, we’ll explore some other interesting design choices of compaction, such as how repair affects things, vnodes, disk failure and recovery.

If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.

cassandra