The arrow of time

Ivan Voras' blog

UFS read-ahead

After 10 years of it being conservatively tuned, I've recently increased the default read-ahead (vfs.read_max) in FreeBSD from 128 KiB to a whopping 512 KiB. And of course, I have received an e-mail from a concerned developer asking if that is perhaps too high :)

How much impact can read_max have I'll try to illustrate on this excellent example on a machine I'm currently configuring.

The machine is a small but modern server with two 7200 RPM SATA drives, which I've configured to use AHCI and gmirror for software RAID 1. As a way of jumping to the point, look at these three results from bonnie++ on this configuration and notice the rise in read bandwidth:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
fkit.fer.hr 8G 830 99 71842 9 37219 6 1432 99 80352 9 331.2 5
Latency 9948us 762ms 1056ms 15487us 49611us 293ms

Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
fkit.fer.hr 8G 806 99 72148 8 37570 7 1410 99 140117 18 327.3 4
Latency 10976us 762ms 748ms 8429us 58998us 3352ms

Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
fkit.fer.hr 8G 826 99 72233 8 37741 7 1417 98 141587 18 327.0 5
Latency 10348us 438ms 993ms 38911us 36932us 3330ms

 

The first run was performed with read_max=32 (the new default, equivalent to 512 KiB), the second with read_max=128 (equivalent to 2 MiB) and the third with read_max=256 (equivalent to 4 MiB, which has proven to be over the usable limit). The gmirror was configured with the default load balancing algorithm.

Notice the rise, from 80 MB/s (the average performance of a single drive) to 140 MB/s (the performance of rougly 1.75x of two drives), achieved with nothing but increasing the read-ahead setting.

The magic by which this is happening is actually not within gmirror or GEOM or even UFS (up to a point), but it is in the NCQ facility of the hard drives in question. Looking at real-time IO statistics with gstat, it is easy to see how it works:

   11    479    479  61252   14.8      0      0    0.0   96.1| ada0
    5    567    567  72121   10.6      0      0    0.0   98.6| ada1
   16   1046   1046 133374   12.5      0      0    0.0   99.9| mirror/data

The first number is the length of the IO queue, the second is total IOPS, the third READ IOPS and the fourth, READ BANDWIDTH. The other columns are analogous for write operations.

By setting read_max to 2 MiB we can have 2048 / 128 = 16 queued IO operations on the mirror device ("mirror/data"), which are spread to the individual drives in some way. The drives will perform large reads themselves and return the data which will be ready when UFS demands it.

Unfortunately, I think vfs.read_max is one of the things which ZFS ignores within the VFS so I don't know how to control it in ZFS.

The equivalent settings for the write side of the IO are vfs.hirunningspace and vfs.lorunningspace, which control the maximum total size of write-queued IO requests and the maximum desirable size of these requests (the difference between them is what is being flushed to the drives in the "wdrain" IO state). I've increased them from 512 KiB / 1 MiB to a tunable depending on the amount of RAM in the machine, so they will roughly be 2.5 Mib / 4 MiB on a machine with 4 GiB of RAM. This translates to upto 4096 / 128 = 32 full-size (MAXPHYS) IO write requests will be handed over to NCQ. The results are similarily interesting but I'll demonstrate them some other time.

All these variables are sysctls which can be set at run-time. I've only changed their defaults, not their implementations, and this change will be firstly visible in 9.0-RELEASE. Until then, there is nothing stopping anyone from simply using the new values manually in their /etc/sysctl.conf :)

#1 Re: UFS read-ahead

Added on 2010-11-19T23:11 by randallmcm

Great article

Starting/stopping Akonadi Server (with mysql backend) would take ~6 secs on vfs.read_max=32.

Switching to vfs.read_max=128 results in ~2 sec start/stop.

#2 Re: UFS read-ahead

Added on 2010-11-20T06:13 by madtrader

I was recently working on some ZFS tuning for an HTPC (on FreeBSD 8.0) and looking at these kinds of values.  Adjustments I looked at and tweaked with:

vfs.zfs.txg.timeout
vfs.zfs.vdev.max_pending
vfs.zfs.zfetch.block_cap

The first you see a lot in forums.  It's not precisely what you're tweaking here, but is important if you're looking at ZFS.  The second is related to queue depth.  The default value in 8.0 is 35, which I found much too high for my workload.  I've read that some newer versions of ZFS default to 10.  The third is most closely related to vfs.read_max.  It's the number of blocks used in read-ahead when ZFS prefetch is enabled.  The default value in 8.0 is 256.  This could be too high as well.  There's one more tunable that isn't exposed in FreeBSD 8.0.  In Solaris it's zfs_write_limit_override.  It allows you to set the size of batches written to disk.  I'm told that adjusting this alone may negate the need to tweak the other values.

#3 Re: UFS read-ahead

Added on 2012-02-20T15:43 by Ray

Hi, Thank you for this info. I am using an SSD-Drive and was wondering what setting I should use for VFS.Read-Max, Should I still only set it to 128 or somthing higher?

#4 Re: UFS read-ahead

Added on 2012-02-20T15:52 by Ivan Voras

SSDs have much lower latencies than mechanical drives, so you wouldn't need as much read-ahead to work around such latencies, but it still may be beneficial. There isn't a "one size fits all" answer and you should simply try different sizes and benchmark them, but I suspect that raising read_max above 128 will not yield noticable benefits because 128 corresponds to 2 MiB of read-ahead with 16 KiB blocks (default in FreeBSD <= 8) or 4 MiB with 32 KiB blocks (default in FreeBSD >= 9).

#5 Re: UFS read-ahead

Added on 2012-02-20T17:15 by Ray

Hi, Thanks for the info. I am running an ssd and I was wondering if I should set the VFS-READ-MAX to 128 or somthing else??

Thank you

#6 Re: UFS read-ahead

Added on 2012-02-20T17:17 by Ray

Hi, Sorry I did not refresh my page so I did not see your ans. Thanks for the fast response and I will leave it at 128.

Post your comment here!

Your name:
Comment title:
Text:
Type "xxx" here:

Comments are subject to moderation and will be deleted if deemed inappropriate. All content is © Ivan Voras. Comments are owned by their authors... who agree to basically surrender all rights by publishing them here :)