You're Already Eventually Consistent
New people to Apache Cassandra are often concerned about the phrase “eventual consistency.” It’s one of those things that seems so foreign, especially if you’re coming from a relational database. When I am with with my RDBMS I get wrapped in the sweet cocoon of ACID transactions!
Is the entire system really safe though? Are we perfectly ACID throughout our entire application? Probably not. Let’s see how it breaks down and where the tradeoffs are.
The first way we start to scale is via replication, so let’s just consider how that works. When we write data, what ideally happens? In a dream world, every replica would immediately have a copy and be completely in sync, right? Unfortunately the world isn’t perfect, and that plan is a massive pile of fail.
In the real world, servers fail, we hit garbage collection pauses, we’ve got noisy neighbors (in the cloud), power failures, drive failure, etc. If we had to wait for a broken replica to handle the write we could be sitting around for a while. Any server in our data center can now be a potential bottleneck, resulting in wildly unpredictable performance. Yikes. In other words, trying to be perfectly consistent across the entire system opens us up to all sorts of scenarios where we will have downtime. The more servers we have, the greater the risk of failure and downtime. We haven’t even considered the possibility that our network might not be reliable. Don’t think this is a real problem?
In any sane system, we’re going to want to use asynchronous replication. Asynchronous replication allows us to accept a write on our primary server and immediately return success to the client while the data replicates in the background.
Replication Lag
Once we’ve accepted that we’re going to do our data replication asynchronously, it’s time to ask ourselves what the impact of this will be. The first thing that you’ll notice is that you can’t write data to the primary and read from a replica, you could get stale data. This is what’s known as replication lag.
I’m most familiar with replication lag in MySQL, which I’ve used in production since 2002. However, it’s a reality in any database where you are restricted by the speed of light, AKA every database on the planet. How far behind your replicas are is a factor of the speed of writes, read load on the replicas, and physical distance between the machines. You’ll have to deal with replication lag whether in a single DC or multiple data centers, but it’s greatly exaggerated with multiple data centers.
For an over the top example of the impact of light speed, let’s suppose you had a data center on Earth and one on Mars, and we use a laser beam to transmit data. Replication should be pretty quick, right? Wrong. The average time for light to travel from Earth to Mars is 12 minutes. Since replication is typically TCP based (for a good reason), you’d be dealing with a replication lag of some multiple of 12 minutes depending on how many round trips are required. Not very useful.
Here on Earth we’re still limited by the speed of light - the minimum possible ping from New York to Paris is 40ms. Read more on the theoretical vs real world speed of light for a ping to get a deeper understanding of how our real world multi data centers are affected by this limitation.
Replication lag is not limited to the RDBMS. It’s present in MongoDB, Redis, and every other distributed database on the planet. It will always be the case. Unless we master wormholes maybe. I’m not counting on wormholes anytime soon though.
I’m not using replication!!
If you’re working in an environment where you have your entire database on 1 machine, you’re in a situation I like to call “medium data”. Are you free from the complexity of distributed database?
The answer, unfortunately is, “it depends”. Are you using separate search software, like Solr, or Elastic search? Guess what? You’ve got a distributed system. Are you using a cache? Distributed system. It’s pretty easy to have an application where the same data is written in multiple places and it’s possible to get inconsistent views. For instance, maybe you’ve deleted a user out of your database but the search index hasn’t received the update yet, so when you query search you get a user back that no longer exists, or you need to bust caches whenever a record is updated. There’s race conditions and edge cases that can result in your data appearing out of sync anywhere from seconds to minutes. It’s important to understand what they are and be prepared for the inevitable.
If you’re not yet working with a distributed system, what happens when you outgrow a single server where you get these convenient guarantees? Answer: You will have to go through your entire application looking for race conditions and edge cases. It’s not a fun migration process.
Assuming you are in the above camp of using replication or multiple systems, you’re already working with eventually consistent systems. You already code against them. If this is you, let’s please stop pretending we get the convenience of ACID in our distributed OLTP systems.
OK, I believe you, how does this work with Cassandra?
In the Cassandra world we have something called tunable consistency. This means we have the flexibility of picking between extremely high availability and strong consistency, at the cost of additional overhead.
To ensure high availability, we’re going to have multiple copies of our data. This is called our replication factor. Most of the time we choose 3 copies, so we say we’re RF=3. There isn’t a primary or master, you can read and write from any of the replicas. When we perform a write to Cassandra, we get to pick our consistency level. If you’re in a write heavy environment, for example an analytics app, you might only want to wait for 1 of the 3 nodes to acknowledge the write was successful. That means you could have 2 out of 3 nodes responsible for the data down and your app will still be responsive. So we write at one:
When we write at ONE, it’s important to know that the data will eventually be written to the remaining replicas. In fact, if the system is functioning normally, the writes will be written immediately. We just choose not to wait because it could take a little longer and if there’s server or network issues we don’t want to be affected. When everything is behaving normally again, the changes will replicate over and will sync up with the rest of the cluster. We code defensively, planning for hardware or network failure.
Perhaps you’re building an application where you’d like to be more confident you’re getting the most recently written data. It’s an option is to query at quorum, which means you need an majority agreement. You can read and write at quorum to get strongly consistent behavior, at the potential cost of availability. You can survive the failure of 1 of 3 replicas before you get a failure response. If this happens, a process called Hinted Handoff ensures when connectivity is restored, any missing data is sent over and the server becomes consistent again.
The last thing we’ll look at is Cassandra’s least HA option, lightweight transactions, abbreviated LWT. LWTs give us a very strong consistency model - a consensus algorithm called Paxos, to provide linearizable consistency for times when we absolutely must be strongly consistent. An example might be not selling the same concert seat to 2 people. It’s flexible enough to also be used as a replacement for some of Zookeeper’s functionality. This comes at a tradeoff, of course, of potentially decreased availability, and slower performance, but sometimes this is acceptable.
When we say Cassandra has tunable consistency, you, the developer, get to turn the knobs to get the best tradeoff of consistency and availability in your application. There are many cases where a developer wants to have very strong consistency, but in reality it doesn’t matter. Consider that in an active system, by the time we’ve streamed bytes from the database to the user, the user’s view is already out of date. There’s very little point in getting wrapped up in requirements that can never be correctly satisfied.
I hope this has cleared up some misconceptions about eventual consistency. You can find me on Twitter as @rustyrazorblade if you’ve got any questions!
If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.