Further benchmarks, and a step back for consideration

So, when last I blogged about the ZFS RAID server, I may have ended on a down note, suggesting disappointment. I hope readers will understand that’s not the case.

When I started this project, I sat down and examined what was most important to me in the server.

    My requirements:
  • Saturate an aggegated 2× GigE link for sustained reads and writes
  • Do it cheaply
    My strong desires:
  • ZFS for its reliability, redundancy, flexibility, and ease of use
  • Maximise the amount of usable space

ZFS wasn’t a requirement. It couldn’t be: it’s a solution, and defining requirements in terms of pre-ordained solutions is, at best, compromised. Maximising IOPS wasn’t a first priority, sustained write performance was. Still, I want to have decent random seek performance, because there will always be a case where performance falls to that level.

I ran some additional Bonnie-64 tests to see the difference between the SATA controllers and their PCI-X buses. There was a small (a couple percent) but consistent difference between the two controllers. I believe one ran at 133MHz, and the other ran at 100MHz (but, to be honest, I don’t know what tools I would use to verify such a thing). So I moved a disk controller from the 100MHz bus to a PCI-X slot on the shared 133MHz bus, and ran the same tests as before.

The results are as follows:

Block Writes, MB/sec

I perceive a strong levelling-off of streaming write performance, even lower than with the previous test. The peak for three 4+1 RAID-Z groups is 387.5 MB/s, while the peak for five 2+1 groups is 354.5 MB/sec. The mirrored scenario’s limits are even lower, at 258.5 MB/sec.

Block Reads, MB/sec

The continuous read performance is even more interesting. Now, it’s clear that the two controllers are maxing out a single, contended PCI-X bus where they hadn’t before. The read limit is at 520 MB/sec. That, to me, sounds very much like one half of the throughput of a 64-bit, 133MHz bus (1064 MB/s). (It’s within 2.5% of half that figure.) One conclusion could be that ZFS performs two reads for every block requested from disk, whether it be RAID-Z or mirror.

Taking a step back, should we find significance in the fact that one-third of our PCI-X bus throughput is 354.67 MB/s, while the most we could squeeze out of the 2+1 RAID-Z configuration was 354.5? It would certainly square with what commenter “mrb” stated: for 2+1 RAID-Z sets, expect 50% higher throughput on reads than writes.

Random Seeks /sec

The random seek performance doesn’t yet tell me much other than the theory of IOPS scaling linearly with the number of vdevs or mirror disks simply does not hold on my system. Frankly, I’m stumped at how it only increases logarithmically. Well, at least it increases monotonically.

Let’s pit theory against practice. I originally posted a crude, back-of-the-envelope model of read/write/random ZFS performance for 14 or 15 disks in 2+1, 4+1 or 2× mirror configurations. What happens to our experimental results when I factor out the (estimated) base performance of a single drive/vdev set?

Sequential I/O
config Random Reads Read Write Capacity
RAIDZ: 3×(4+1) 3y 12z 12z 6.0TB
RAIDZ: 5×(2+1) 5y 10z 10z 5.0TB
mirror: 7×2 14y 14z 7z 3.5TB
Random Reads
Sequential I/O
config Read Write Capacity
RAIDZ: 3×(4+1) 2.3y 7.2z 5.3z 6.0TB
RAIDZ: 5×(2+1) 3.3y 8.8z 5.2z 5.0TB
mirror: 7×2 3.8y 10.2z 4.8z 3.5TB

It’s clear that ZFS is demanding enough that it can hit the limits of the PCI-X bus on a poorly thought-out system. I can sketch out those limits on my own system, in some cases. It’s also true that I could have chosen another motherboard with 2 independent 133MHz PCI-X buses, or gone with a PCIe solution that would have eliminated any concerns about bus bandwidth. In theory, with this many disks, I could be seeing twice the performance in some situations. However, I should look at the numbers: 390MB/s far exceeds my ability to get data into or out of the machine via the network.

The machine does what it is supposed to, and surprisingly affordably, too. Any “disappointment” I have is purely theoretical.

As a postscript, I should make a call out to anyone who would like further data with the 133+100MHz controller configuration. The server is leaving the workshop and going into the rack now, but the system will be under test for a few weeks more. Contact me via the comments, the contact form on this site, or the zfs-discuss list if you have a particular scenario you’d like me to run.