The arrow of time

Ivan Voras' blog

pbzip2 and tar

I have several large nightly backups that run almost the whole night and was searching for ways to improve both the speed and the archive size. The difference between bzip2 and gzip is large enough to use bzip2 and fortunately there's something that can improve its horrible performance: pbzip2. I also thought that sorting the files by their extension, like Windows compressors do, will reduce the final archive size, but this didn't go very well.

Parallel bzip2 is a project that uses libbzip2 to compress the data, but splits input data into chunks that are compressed individually by different threads. This leads to very large performance improvements (practically linear) on multi-CPU systems. Unfortunately, it doesn't implement compressing from stdin (only files are supported) so I needed to modify it. The patch slightly modifies the compression code to not depend on input file size and allows it to read data from arbitrary streams. I've contacted the author and the patch will be included in the next version of pbzip2 (whenever it gets released). The patched pbzip2 needs two arguments to act as a compression filter: "-c -", with the "-" being the magic filename for stdin (as is the convention in many Unix utilities).

Archives produced by pbzip2 are compatible with regular bzip2, but apparently bsdtar (or more correctly libarchive) doesn't understand multi-stream bzip2 files so it can't be used to decompress them directly. Piping the data through bzip2 or pbzip2 works. I've contacted bsdtar's author and this should be fixed soon.

Finally, I've tried to test if sorting the files by their extension before compressing them will result in significant improvements in compression performance. The idea for this came from 7-zip, which among its many modes of compression implements bzip2 for its internal file format, in a way that also supports multithreading, with excellent performance (but is useless for backups since it doesn't archive Unix attributes - ownerships and modes). Unfortunately the bzip2 algorithm produces larger archives than its own LZMA algorithm, which isn't very MP-scalable. But, it gave me an idea to try the sorting thing at least so I created a script that collects the filenames, sorts them by extension and then passes them as a list to tar. This didn't work out: the archives produced this way differ only for a very small fraction of their size - it's an improvement but not a great one. Anyway, if someone's interested, here's the sorting script.

great

Added on 2008-10-26T07:14 by rorya

Awesome work, Ivan, keep it up.

Post your comment here!

Your name:
Comment title:
Text:
Type "xxx" here:

Comments are subject to moderation and will be deleted if deemed inappropriate. All content is © Ivan Voras. Comments are owned by their authors (who agree to basically surrender all rights by publishing them :) )..