The arrow of time

Ivan Voras' blog

Impact of malloc debugging in CURRENT

Before a bunch of CURRENT FreeBSD code is blessed into a STABLE branch, it has many debug options enabled that help developers track and solve problems. The most well known options are WITNESS and INVARIANTS in the kernel, and MALLOC_PRODUCTION in the userland (libc). Each of these has a significant impact on the system's performance if configured for debugging.

WITNESS and INVARIANTS usually have the biggest impact on performance as they are directly influencing core functionality of the kernel. They are so often used that it's been my habit for a long time to switch them off if I'm using the system for Real Work (tm). Malloc debugging is a bit non-straightforward to disable and usually requires recompiling and reinstalling the libc so I usually keep it on. It's turned off by defining MALLOC_PRODUCTION in src/lib/libc/stdlib/malloc.c at line 102. With debugging enabled, malloc(3) clears every bit of memory allocated and has additional sanity checks in its algorithms.

It turns out the performance impact of enabled malloc debugging can be extremely noticable on some workloads. Here's a run of unixbench with malloc debugging on (without WITNESS and INVARIANTS, on a quad-core 2.4 GHz desktop CPU):

                     INDEX VALUES
TEST BASELINE RESULT INDEX

Dhrystone 2 using register variables 116700.0 14033110.0 1202.5
Double-Precision Whetstone 55.0 3047.8 554.1
Execl Throughput 43.0 1838.0 427.4
File Copy 1024 bufsize 2000 maxblocks 3960.0 87304.0 220.5
File Copy 256 bufsize 500 maxblocks 1655.0 104302.0 630.2
File Copy 4096 bufsize 8000 maxblocks 5800.0 69077.0 119.1
Pipe Throughput 12440.0 1141396.8 917.5
Pipe-based Context Switching 4000.0 134318.4 335.8
Process Creation 126.0 4600.3 365.1
Shell Scripts (8 concurrent) 6.0 120.8 201.3
System Call Overhead 15000.0 827474.0 551.6
=========
FINAL SCORE 412.5

And here's the same with malloc debugging off:

                     INDEX VALUES
TEST BASELINE RESULT INDEX

Dhrystone 2 using register variables 116700.0 14012195.1 1200.7
Double-Precision Whetstone 55.0 3051.6 554.8
Execl Throughput 43.0 1824.3 424.3
File Copy 1024 bufsize 2000 maxblocks 3960.0 86489.0 218.4
File Copy 256 bufsize 500 maxblocks 1655.0 105133.0 635.2
File Copy 4096 bufsize 8000 maxblocks 5800.0 69004.0 119.0
Pipe Throughput 12440.0 1141960.6 918.0
Pipe-based Context Switching 4000.0 128159.7 320.4
Process Creation 126.0 4663.1 370.1
Shell Scripts (8 concurrent) 6.0 924.5 1540.8
System Call Overhead 15000.0 827695.6 551.8
=========
FINAL SCORE 494.4

The biggest influence is on the "Shell scripts (8 concurrent)" benchmark. This benchmark simulates shell scripts that do a lot of text processing with grep, sed, sort and similar utilities. I don't exactly know why would those utilities be so malloc-intensive but apparently they are. Without malloc debugging this benchmark runs more than 7 times faster!

I'd also like to mention that so far, 8-CURRENT has been probably the most stable (as in: reliable, non-crashing) CURRENT ever. Though it's by definition full of recent and untested code, it's extremely stable and usable. I hope it stay so after the pending big chunks of code (VIMAGE, Xen, new USB stack, new ZFS) get integrated.

Post your comment here!

Your name:
Comment title:
Text:
Type "xxx" here: