vfs.hirunningspace and disk write latency performance

A while ago I increased the default value for the vfs.hirunningspace tunable - which greatly helps with performance when the disk system supports tagged queueing (e.g. NCQ), allowing many more requests to be offloaded into the controller and/or the drive(s). But deep queues bring their own problems, especially in pathological cases.

Vfs.hirunningspace basically controlls how much write data (in bytes) can be outstanding for write IO requests. This is orthogonal to synchronous / asynchronous writings and encompasses all parallel write IO. The historical default was 1 MiB, meaning at most 1 MiB of data could be queued globally for writing to devices. Of course, this is a very small amount of data these days.

The new default is to autotune it with an upper limit of 16 MiB. With the 16 MiB limit, up to 128 IO requests of 128 KiB in size can be queued, which is enough to fill most RAID controllers' buffers. For larger systems, the value can be manually tuned to arbitrary values.

The downside is that a small bit of latency / interactivity is lost with big queues, which is especially visible in the pathological case when a high-speed drive is used in parallel with a slow (or broken) one. In my case, I have just such a thing - a USB flash drive which is both high-latency (it occasionally doesn't answer to IO for hundreds of ms) and low bandwidth (around 2 MiB/s) - basically it's broken. Writing to the regular disk drive (which can easily accept more than 80 MiB/s) and the flash drive at the same time reduces the apparent disk drive bandwidth to about 50 MiB/s and makes it appear "bursty" instead of sustained.

What is happening is that, with long queues (virtual queues in this case, governed by this cap, there are no real global queues here) and with the fast device responding quickly, the queues basically become filled only with "slow" requests and so, the faster device is penalized.

This cannot be solved with drive-level IO scheduling, something like per-device bandwidth (or runningspace) reservations is needed.

Of course, this behaviour is present with any hirunningspace size, having it larger just makes it a bit more visible.

Comments !