The arrow of time

Ivan Voras' blog

Progress in CPU architectures

I got access to a Sandy Bridge Xeon server for a while and decided to try some benchmarks on it. Usually, I'd run unixbench from the FreeBSD ports, but - it's old. It is very, very old and the only reason I still use it is convenience. So, having some time, I started my own multiprocessor-friendly benchmark suite. Yup, I could have named it "yet another benchmark suite" since there are so many of them. Unfortunately, most are bad, and I will try to make this one into something reasonable. Anyway, I did a comparison between some Xeons I have access to, out of curiosity.

Without much introduction, here are the results for thee systems: Dual Xeon 5430 at 2.66 GHz, a single Xeon 3440 (2.53 GHz) and the new Sandy Bridge Xeon E3-1230 at 3.2 GHz. It is immediately visible that the Sandy Bridge CPUs have noticably higher clock rates, but at the same time a much smaller clock rate span (from 3.1 to 3.5 GHz - and the 3.5 GHz one is almost 6x more expensive than the 3.1 GHz one). Each system has a different generation CPU architecture.

The scores are scaled performance measurements of the operations; in case of Hash-SHA256 and Zlib-Decompress, the score is MB/s; for Zlib-Compress, it is MB/s scaled up by 10x (to get within the same order of magnitude as the other operations). All these scores are summed up per-thread scores for completely parallel operation on all logical processors in the system (i.e. 8 for all these systems).

Here we have two single-socket systems: the Sandy Bridge one (newest) and the Nehalem-class one, versus a dual-socket Core2-class system (the oldest). Several things immediately stand out:

  1. The dual-socket system (with 8 cores total) is usually not twice as fast than the single socket ones (with 4 cores + hyperthreading)
  2. The Core2-class CPU is apparently very slow while doing zlib compression. My educated guess is that this is due to a combination of the sockets sharing the FSB and the relatively large penalty (almost 3x) that Intel CPUs earlier than the Nehalem generation have for unaligned memory write access. Another large advantage to the Nehalem (and its successor) comes from having an integrated memory controller and very low-latency memory access times.
  3. The Sandy Bridge CPU has a performance curve similar to the Nehalem's; presumably, there are not many more things which can be improved once the memory controller got into the CPU, and its performance comes from frequency scaling and new SIMD instructions (which are not used in these benchmarks).

All these are done on FreeBSD 8-stable amd64.

If anyone wants to repeat the benchmarks, it can be done by checking out the hg repository:

hg clone http:// cosmos.boldlygoingnowhere.org:81/~ivoras/h g/pysysbench

I have since added some more benchmark types. Python 3.1 is required to run the benchmark.

I am confident in writing this in Python as I am only benchmarking operations which are already optimized and implemented in C (hashes, compression) or are basically a series of syscalls. The impact of the python interpreter is minimal.

Post your comment here!

Your name:
Comment title:
Text:
Type "xxx" here:

Comments are subject to moderation and will be deleted if deemed inappropriate. All content is © Ivan Voras. Comments are owned by their authors... who agree to basically surrender all rights by publishing them here :)