Sequence Compression Benchmark

Method

Benchmark machine

CPU: dual Xeon E5-2643v3 (3.4 GHz, 6 cores), hyperthreading: off
RAM: 128 GB DDR4-2133 ECC Registered
Storage: 4 x 2 TB SSD, in RAID 0, XFS filesystem, block size: 4096 bytes (blockdev --getbsz)
OS: Ubuntu 18.04.1 LTS, kernel: 4.15.0
GCC: 7.4.0

What compressors/dataset combinations were tested?

Each setting of each compressor is tested on every test dataset, except when it's difficult or impossible due to compressor limitations:

Due to their extreme slowness, these compressors are not tested on any data larger than 10 MB: cmix, DNA-COMPACT, FQSqueezer, GeCo, JARVIS, Leon, UHT and XM.
BLAST, 2bit and Pufferfish don't support alignments.
2bit, DELIMINATE, MFCompress and Pufferfish don't support protein sequences.
Some settings of XM crash and/or produce wrong decompressed output on some data - such results are not included.

Benchmark process

The entire benchmark is orchestrated by a perl script. This script loads the lists of compressor settings and test data, and proceeds to test each combination that still has its measurements missing in the output directory. For each such combination (of compressor setting and test dataset), the following steps are performed:

Compression is performed by piping the test data into the compressor. Compressed size and compression time is recorded. For compressed formats consisting of multiple files, sizes of all files are summed together.
If compression time did not exceed 10 seconds, 9 more compression runs are performed, recording compression times. Compressed data from previous run is deleted before each next compression run.
The next set of compression runs is performed to measure peak memory consumption. This set consists of the same number of runs as in steps 1-2 (either 1 or 10 runs). That is, for fast compressors and for small data the measurement is repeated 10 times.
Decompression test run is performed. In this run decompressed data is piped to the md5sum -b - command. The resulting md5 signature is compared with that from the original file. In case of any mismatch this combination of compressor setting and dataset is disqualified and its measurements are discarded.
Decompression time is measured. This time decompressed data is piped to /dev/null.
If decompression completed within 10 seconds, 9 more decompression runs are performed and timed.
Peak decompression memory is measured. The number of runs is same as in steps 5-6.
The measurements are stored to a file. All compressed and temporary files are removed.

How time measurement was done?

Wall clock time was measured using Perl's Time::HiRes module (gettimeofday and tv_interval subroutines). The resulting time was recorded with millisecond precision.

How was the peak memory measured?

First, the compression command is stored in a temporary shell script file. Then it is executed via GNU Time, as /usr/bin/time -v cmd.sh >output.txt. "Maximum resident set size" value is extracted from the output. 1638 is then subtracted from this value and the result is stored as peak memory measurement. 1638 is the average "Maximum resident set size" measured by GNU Time in the same way for an empty script.

Why not measure memory consumption and time simultaneously?

Because measuring memory makes the task noticeably slower, especially for very fast tasks. Of course the downside of separate measurement is that it takes twice as long, but we decided that accurate timing results are worth it.

What measurements are collected for each test?

Compressed size (in bytes)
Compression time (in milliseconds)
Decompression time (in milliseconds)
Peak compression memory (in GNU Time's "Kbytes")
Peak decompression memory (in GNU Time's "Kbytes")

In cases where 10 values are collected, the average value is used by the benchmark web-site.

How are the other numbers computed?

Compressed size relative to original (%) = Compressed size / Uncompressed size * 100
Compression ratio (times) = Uncompressed size / Compressed size
Compression speed (MB/s) = Uncompressed size in MB / Compression time
Decompression speed (MB/s) = Uncompressed size in MB / Decompression time
Compression + decompression time (s) = Compression time + Decompression time
Compression + decompression speed (MB/s) = Uncompressed size in MB / (Compression time + Decompression time)
Transfer time (s) = Uncompressed size / Link speed in B/s
Transfer speed (MB/s) = Uncompressed size in MB / Transfer time
Transfer + decompression time (s) = Transfer time + Decompression time
Transfer + decompression speed (MB/s) = Uncompressed size in MB / (Transfer time + Decompression time)
Compression + transfer + decompression time (s) = Compression time + Transfer time + Decompression time
Compression + transfer + decompression speed (MB/s) = Uncompressed size in MB / (Compression time + Transfer time + Decompression time)

Why not always perform the same number of runs in all cases?

Variable number of runs is the only way to have both accurate measurements and large test data (under the constraints of using one test machine, and running benchmark within reasonable time).

On one hand, benchmark takes lot of time. So much that some compressors can't be even tested at all on dataset larger than 10 MB in reasonable time. Therefore repeating every measurement 10 times is impractical. Or, it would imply restricting the test data to only small datasets.

On the other hand, measurements are slightly noisy. The shorter measured time, the more noisy its measurement. Thus for very quick runs, multiple runs allow for substantial noise suppression. For longer runs it does not make much difference, because the relative error is already small with longer times.

Using a threshold of 10 seconds seems a reasonable compromise between suppressing noise and including larger test data (and slow compressors).

Are there other ways to reduce measurement noise?

Other ways that we are using:

Disabling hyperthreading.
Not running any other tasks while benchmark is running.
Running only one compression or decompression task at a time. (Which means that unfortunately most cores of the machine are idle while the benchmark is running).
Running benchmark with high priority (nice -n -20 and ionice -c1)
Having enough RAM so that the data being compressed or decompressed is always already cached in memory when running compression or decompression tasks.
Piping decompressed data to /dev/null during measurements.

Additional improvement could be achieved by utilizing multiple machines to collect larger sample. We may explore this in the future.

Is the benchmark script available?

Yes, here:

benchmark-script.zip

It's provided for reference only, use at your own risk.