Surprising Results Benchmarking Direct IO vs. Buffered IO using fio

                February 1, 2024

            Surprising Results Benchmarking Direct IO vs. Buffered IO using fio

            I'm writing a program that needs to write a lot of data sequentially to disk. There are several parameters I can vary that affect the throughput of my writes - the size of blocks I write to disk, how many outbound buffers I use, etc. 
Today, I was trying to understand the way that different parameters like this affect throughput using fio, which is a program that allows you to simulate various kinds of workloads that read and write from disk. This is my first time using fio, so I'm learning some interesting things.
The default for fio is to do buffered IO. This means that when it writes data, that data is first copied into the kernel page cache, and later, that data is actually written to disk by the kernel.
One of the available options in fio is the direct flag. This flag causes fio to instead use direct IO when writing data to disk. Direct IO writes data directly to disk, without first copying it to the kernel page cache. In cases where you know you won't be needing the benefits of kernel page cache, you can use direct IO to (in theory) improve throughput when writing a lot of data to disk. 
In my project I'm working on, I have been using direct IO because I assumed it was a good fit. 
So when exploring performance with fio, I had been using the direct flag, and testing things like varying block size, varying numbers of output buffers, and so on, to understand what the upper bound of performance would be given these different parameters.
Here's some sample output:
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=32k --iodepth=64 --size=4G --readwrite=write --numjobs=1 --group_reporting --direct=1
... output omitted ...
Run status group 0 (all jobs):
  WRITE: bw=205MiB/s (214MB/s), 205MiB/s-205MiB/s (214MB/s-214MB/s), io=4096MiB (4295MB), run=20025-20025msec
... output omitted ...
You can see the average write throughput is 214MB/s. 
But then I decided to turn off the direct flag, and I was surprised by what I saw! 
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=32k --iodepth=64 --size=4G --readwrite=write --numjobs=1 --group_reporting
... output omitted ...
Run status group 0 (all jobs):
  WRITE: bw=3714MiB/s (3894MB/s), 3714MiB/s-3714MiB/s (3894MB/s-3894MB/s), io=4096MiB (4295MB), run=1103-1103msec
... output omitted ...
Throughput went from 214MB/s to 3894MB/s! What the heck!
My first theory was that the buffered IO version wasn't actually waiting for data to sync to disk. My machine has 29.4GB of RAM, so the whole 4GB file we're writing would fit entirely in RAM.
It turns out there's an fio flag called fsync_on_close that causes an fsync to occur when the file is closed. This would force data to be flushed from the page cache to the actual disk.
Sure enough, that caused the throughput to drop back down quite a bit!
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=32k --iodepth=64 --size=4G --
... output omitted ...
Run status group 0 (all jobs):
  WRITE: bw=438MiB/s (459MB/s), 438MiB/s-438MiB/s (459MB/s-459MB/s), io=4096MiB (4295MB), run=9351-9351msec
... output omitted ...
You can see the new throughput is 459MB/s. So it seems like my theory was mostly correct.
However, direct IO does not necessarily guarantee all data is flushed to disk when it completes. So to have a fair comparison, I reran the test with direct IO and also including the flag fsync_on_close:
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=32k --iodepth=64 --size=4G --readwrite=write --numjobs=1 --group_reporting --fsync_on_close=1 --direct=1
... output omitted ...
Run status group 0 (all jobs):
  WRITE: bw=218MiB/s (229MB/s), 218MiB/s-218MiB/s (229MB/s-229MB/s), io=4096MiB (4295MB), run=18752-18752msec
... output omitted ...
You can see that the write throughput is 229MB/s this time - in the same ballpark as running it without fsync_on_close. What this tells me (/confirms to me) is that even though there's no guarantee that data is all flushed to disk when a direct IO write completes, in fact, most of that data is actually probably flushed. So fsyncing when the file is closed does not really affect performance.
If you're paying attention, there's still one unexplained mystery. In the buffered IO case, we're still getting a throughput of 459MB/s, which is higher than what I was getting with direct IO. Why is that? 
Well, one might assume that when data is flushed from the kernel page cache to disk, the kernel is doing that in the most efficient way possible. But our fio direct IO invocation is not completely tuned.
It turns out that increasing the block size from 32K to 64K does the trick!
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=64k --iodepth=64 --size=4G --readwrite=write --numjobs=1 --group_reporting --fsync_on_close=1 --direct=1
... output omitted ...
  WRITE: bw=464MiB/s (487MB/s), 464MiB/s-464MiB/s (487MB/s-487MB/s), io=4096MiB (4295MB), run=8819-8819msec
... output omitted ...
So you can see we've gotten back up to 487MB/s. So the direct IO implementation indeed is as performant as the buffered IO implementation. 
But shouldn't direct IO have better performance than buffered IO? 
Let's focus on a case where it seems like direct IO might be better: When the file being written is larger than memory. In this case, there should be more frequent page cache flushes while the writing is occurring, in addition to fsyncing at the end.
So I reran the tests with a 64G file to see what happened:
Buffered IO gets 382MB/s:
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=64k --iodepth=64 --size=64G --readwrite=write --numjobs=1 --group_reporting --fsync_on_close=1
... output omitted ... 
  WRITE: bw=364MiB/s (382MB/s), 364MiB/s-364MiB/s (382MB/s-382MB/s), io=64.0GiB (68.7GB), run=179834-179834msec
... output omitted ... 
Direct IO gets 382MB/s:
# fio --randrepeat=1 --ioengine=io_uring --gtod_reduce=1 --name=test --filename=/data/test_file --bs=64k --iodepth=64 --size=64G --readwrite=write --numjobs=1 --group_reporting --fsync_on_close=1 --direct=1
... output omitted ...
  WRITE: bw=364MiB/s (382MB/s), 364MiB/s-364MiB/s (382MB/s-382MB/s), io=64.0GiB (68.7GB), run=179935-179935msec
... output omitted ...
Turns out they get literally the exact same throughput (382MB/s). 
So what's missing here? I don't actually know!
Maybe since there's nothing competing for the page cache, there don't have to be flushes to disk as frequently, and if there were something competing for the page cache, there would be more overhead to buffered IO? 
But still, I would expect direct IO to benefit more from avoiding the copy into page cache. It might be the case that copy is just much cheaper than one might think.
Setting aside why the performance isn't better, even if the performance were exactly the same, I guess in cases where the data you're writing isn't going to be read soon, direct IO skipping the page cache would be useful to allow more pages in the page cache for reads.
But I feel like I'm missing something else here. If you know what it is, please send me an email.
In any case, it seems like if you want to use direct IO in a way that outperforms buffered IO, you better really know what you're doing!

Don't miss what's next. Subscribe to What the what: