Sequence Compression Benchmark

Examples

Each of these examples uses to answer a specific question. After clicking each link, feel free to tweak the options to construct additional outputs or explore the data in more detail.

If you can think of other interesting questions to explore using this benchmark data, please send them to us for inclusion here.

General questions

Q: What is the entire data visualized by this benchmark?

Data:

Note that this data includes only the numbers, and does not contain extra information such as which compressors are free or open source.

Q: Which compressor provides strongest compression?

Q: And how much space can be saved by switching from gzip to a better compressor?

Data:

Analysis: cmix probably gives the best compression. Since it is very slow, it seems usable only with small data. But it's the large data where compression is needed the most. In case of DNA sequences there are other strong and extremely slow compressors: XM, GeCo, DNA-COMPACT and JARVIS.

Of the remaining compressors, mfc-3, naf-22 and dlim seem to offer the strongest compression for genome data.

For non-genome, more repetitive data, naf-22, xz-e9 and brotli-11w30 offer strong compression.

Advantage of better compressors over gzip-9 is about ~1.4 times for genome data and over 3 times for non-genome data.

Q: Which compressor provides the best Transfer+Decompression speed?

Q: In other words, which compressor is most suitable for compressing data in a public database?

Data:

Note: Link speed of 100 Mbit/sec is used for calculating the transfer time.

Interpretation: naf-22, zstd-22 and brotli-11 seem to perform the best overall.

Q: Which compressor provides the best Compression+Transfer+Decompression speed?

Data:

Note: Link speed of 100 Mbit/sec is used for calculating the transfer time.

Interpretation: naf-1, zstd-3-4t and brotli-1 perform well.

Q: How much memory does strongest setting of each compressor use?

Data:

Apparent memory leaks: Pufferfish, NUHT.

Q: How much memory does fastest setting of each compressor use?

Data:

Q: What proportion of calculation time is spent in the wrappers?

We can understand this by comparing timing of an entire wrapped compressor with that of the corresponding "wrapper-only" run.

Data for sequence compressors:

2bit Compression time Decompression time
ac (ac-fa) Compression time Decompression time
ac (ac-seq) Compression time Decompression time
blast Compression time Decompression time
dcom Compression time Decompression time
dnax Compression time Decompression time
geco Compression time Decompression time
jarvis Compression time Decompression time
nuht Compression time Decompression time
pfish Compression time Decompression time
uht Compression time Decompression time
xm Compression time Decompression time

Data for FASTQ compressors:

alapy Compression time Decompression time
beetl Compression time Decompression time
dsrc Compression time Decompression time
fastqz Compression time Decompression time
fqs Compression time Decompression time
fqzcomp Compression time Decompression time
gtz Compression time Decompression time
harc Compression time Decompression time
kic Compression time Decompression time
leon Compression time Decompression time
lfastqc Compression time Decompression time
lfqc Compression time Decompression time
minicom Compression time Decompression time
quip Compression time Decompression time
spring (spring-l) Compression time Decompression time
spring (spring-s) Compression time Decompression time

(Note how GTZ is relatively much slower on small data, most likely because it phones home.)


Questions about specific compressors

Q: Does zpaq benefit from multiple CPU cores?

Data:

Analysis: Yes, starting from certain data size: ~20 MB for level 1, and ~70 MB for levels 2-5. Speedup of up to 3.7x times can be reached on our data. With level 1, however, 4-core advantage is sensitive to redundancy of the data, it disappears for highly repetitive data.

Q: How does single core pbzip2 compare to bzip2 (level 9)?

Data:

Analysis: At compression level 9, pbzip2 has same compression strength and compression speed with bzip2. Interestingly it has 10 times faster decompression. It uses 2 times more memory during compression and 20 times more memory during decompression.