Hardware Profiling With Bcc-Tools
Welcome back to my series on performance profiling! In this second installment, we’re shifting our focus towards a hands-on exploration of our hardware using the power of bcc-tools. Building on the foundational knowledge from our first post about flame graphs, we are now ready to delve deeper into the intricacies of system performance. These tools are a great way of understanding if we’re seeing bottlenecks in our hardware.
For a little background, bcc-tools is a collection of utilities that expose underlying kernel statistics through eBPF. eBPF is an amazing technology that lets us tap into what’s going on in the Linux kernel with negligible performance overhead. Fun stuff!
Our journey today revolves around several utilities in bcc-tools: cachestat
, biolatency
, xfsslower
, and xfsdist
. Each of these tools serves as a different window into some aspect of the I/O system of our machines, offering insights that go far beyond the capabilities of traditional tools like iostat
or dstat
. While these conventional tools provide a broad overview, they often miss out on critical nuances that could be the key to understand our systems, especially since the numbers you’re seeing are often averages. I’ve chosen these tools as they have been incredibly valuable in my career. On multiple occasions they provided the exact insight we needed when doing performance analysis during my time as a consultant as well as optimizing Cassandra clusters at Netflix. There are many other utilities in this package, covering many aspects of our hardware, that I recommend you get familiar with after you’ve finished reading this post.
A note on the names of the tools
You might notice in the examples below the tools I’m using below are suffixed with -bpfcc
, but the ones I listed above are not. This is due to the way it’s packaged for Ubuntu by the distro. You can install the utilities other ways - for more on this please check out the INSTALL document or this post on Linode showing a variety of ways to install the tools. The tools are the same regardless of how you install them, so I’ll refer to them as they’re listed in the original repo and project documentation.
Installation
I’ve included the kernel headers here for completeness. You may not need them, it seems to work differently on different distributions. Older versions of Red Hat fail miserably when trying to run a lot of these commands, you’ll want to use a newer kernel. The examples I’m using here come from my home server running Ubuntu 18.04.4 LTS
. Here’s how I’ve installed the package using Ubuntu’s repository:
sudo apt-get install bpfcc-tools linux-headers-$(uname -r)
To make the tools report something useful, I need to make sure they’re doing some work. I’m using this fio
profile to ensure the disks are serving requests. Fio is an excellent I/O benchmarking tool I’ve used in a lot of different situations to test disk performance.
Here’s the fio
configuration:
[global]
bs=4K
iodepth=256
direct=0
ioengine=libaio
group_reporting
time_based
runtime=5m
numjobs=4
name=raw-randread
rw=randread
directory=./data
size=1g
rate=100m
[job1]
filename=test
I’ve saved it in a text file and started fio like this:
fio config.txt
During the tests I’ll occasionally run the following command to drop the page cache, to force more I/O activity:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Let’s take a look at each of the tools.
biolatency
biolatency
is a very useful tool to have in your tool belt. It tracks the response time for requests at the device level, which is really helpful when trying to determine if I/O is the cause of your database’s latency. This has been an essential tool when trying to determine if a SAN or EBS is responsible for slow response times out of a database. I’ve worked with several folks who have been quick to blame EBS or their SAN, only to find out the disk’s performance has been perfectly fine the entire time. I generally tell people to avoid EBS, but that doesn’t mean it’s always the root cause of a performance problem. It’s a good idea to always verify you actually have the problem you think you do before you start trying to fix it!
biolatency
accept a few parameters. By default, it reports time in microseconds (you can use millis with the -m flag), but you’ll lose resolution on the response times which may not be what you want. Like other *stat type tools, it can accept a interval and a count. I recommend watching for at least 30 seconds to a minute to rule out outliers.
root@vandamme:~# biolatency-bpfcc 1 10
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 2 |*** |
32 -> 63 : 0 | |
64 -> 127 : 1 |* |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 4 |****** |
1024 -> 2047 : 23 |****************************************|
2048 -> 4095 : 7 |************ |
This is about what I’d expect from my Samsung SSD 870 - most of the requests are handled between 1 and 4ms.
cachestat
Having a solid understanding of your page cache hit rate can be extremely informative in several situations. Fetching data from memory is going to be faster than disk, even when using very fast NVMe drives. This leads us to the assumption that we want to fit as much data in the page cache as possible. It can become easy to jump to the conclusion that anything that takes memory away from the page cache is bad. Unfortunately, this might not actually be true. This is a common misconception when doing JVM tuning. I’ve heard people say on several occasions we can’t give more memory to the JVM because it’ll make the page cache less effective, but never bother to test out if that’s actually true, or what the impact would be of a lower page cache hit rate.
Depending your drives, you may find the increased latency from occasionally going to disk decreases JVM pause time and/or frequency, improving your tail latency, especially with faster disks. The response time on NVMe drives is mind bogglingly fast even at high throughput. I’ve run tests showing > 2GB/s at 100 microsecond p99 latency, with max latency never spiking over 2ms. Losing a tiny amount of page cache hit rate might be worth the tradeoff in that case. When working with tables that store lots of tiny partitions, you’ll probably want to give the extra memory to the JVM for compression metadata.
Like biolatency
, you can take an interval and count.
root@vandamme:~# cachestat-bpfcc 1 30
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
102989 0 0 100.0% 0.0% 10 1589
101581 0 0 100.0% 0.0% 10 1589
89928 1031 39 98.8% 1.1% 1 562
19243 18420 0 51.1% 48.9% 1 635
20840 18417 0 53.1% 46.9% 1 707
22867 18430 2 55.4% 44.6% 1 778
24927 18315 0 57.6% 42.4% 1 850
27529 18282 0 60.1% 39.9% 1 921
30444 18258 11 62.5% 37.5% 1 993
33881 18073 0 65.2% 34.8% 1 1063
38042 17923 0 68.0% 32.0% 1 1133
I dropped the page cache to ensure we’d have page cache misses. You can see how the hit rate drops suddenly then starts to pick back up as the random reads access more and more of the file.
xfsslower
This utility is helpful for understanding how your filesystem is behaving. Most of the time, your filesystem operations should be pretty performant since they’re typically writing to page cache, meaning they’re not waiting for fsync. If you’re using a network drive like a SAN you’ll probably see a S
under the T
column for sync, that’s forcing dirty pages to disk, often your commit log when you’re looking at Cassandra. I’m using xfsslower
here but there are different utilities for each filesystem: ext4slower
, zfsslower
, etc.
root@vandamme:~# xfsslower-bpfcc 1
Tracing XFS operations slower than 1 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
23:10:13 fio 30934 R 4096 929932 1.47 test
23:10:15 fio 30933 R 4096 193008 3.83 test
23:10:15 fio 30933 R 4096 193016 1.47 test
23:10:15 fio 30933 R 4096 193184 1.65 test
23:10:15 fio 30933 R 4096 193188 1.56 test
xfsdist
This is a bit like biolatency
, but records all filesystem requests including ones that go to page cache, unlike biolatency
. There’s a clear relationship between all the utilities we’ve seen here, as they all relate to some aspect of I/O, but at different layers. As requests to the system go up, you’ll probably see the filesystem slow down if your cache becomes less effective, sending more requests to your block device. This can cause big problems especially with disks that use an IOPS quota as performance will go straight to zero once you’ve exceeded your limit.
root@vandamme:~# xfsdist-bpfcc 1 10 -m
Tracing XFS operation latency... Hit Ctrl-C to end.
23:51:18:
operation = 'read'
msecs : count distribution
0 -> 1 : 110925 |****************************************|
23:51:19:
operation = 'read'
msecs : count distribution
0 -> 1 : 30715 |****************************************|
2 -> 3 : 73 | |
4 -> 7 : 19 | |
8 -> 15 : 1 | |
At this point you should be ready to jump into the world of bcc-tools! There’s quite a few utilities that we haven’t even looked at yet - covering a wide range of exposing kernel internals, including networking, threading, CPU switching, and a variety of other views into the I/O system. I highly recommend installing the tools and getting used to them, as they’ve been essential in solving several tricky problems over the years.
If you found this post helpful, please consider sharing to your network. I'm also available to help you be successful with your distributed systems! Please reach out if you're interested in working with me, and I'll be happy to schedule a free one-hour consultation.