For a quick refresher, these are the leading factors that impact node density:
Streaming Throughput
Compaction Throughput and Strategies
Various Aspects of Repair
Query Throughput
Garbage Collection and Memory Management
Efficient Disk Access (this post)
Compression Performance and Ratio
Linearly Scaling Subsystems with CPU Core Count and Memory
Why Disk Access Matters for Node Density
In my time of operating and optimizing Cassandra clusters, I’ve found that efficient disk access becomes significantly more important as node density increases. When your nodes hold just 1-2TB of data, you might barely notice inefficient disk access patterns. But push that to 10-20TB per node, and these same inefficiencies transform into critical bottlenecks that can cripple your entire system.
One of the challenges of running large scale distributed systems is being able to pinpoint problems. It’s all too common to blame a random component (usually a database) whenever there’s a hiccup even when there’s no evidence to support the claim. We’ve already discussed the importance of monitoring tools, graphing and alerting metrics, and using distributed tracing systems like ZipKin to correctly identify the source of a problem in a complex system.