Sequence Compression Benchmark

About

Why?

Sequence databases store massive amounts of data, and keep growing rapidly. Efficiently storing and using such data is a major challenge. Currently most databases continue depending on gzip for sequence compression, even though more efficient compressors are now available.

In our own experience, we work with massive sequence data every day, and it's a headache to wait for data transfers. Decompression time also often impacts our data analysis pipelines. We'd like to minimize such costs, by choosing an optimal compressor depending on application.

We decided to benchmark available compressors and to explore these questions:

How different compressors perform on common sequence datasets? (including DNA, RNA and protein)
Which compressor is best for sequence archival? (Has highest compression ratio)
Which compressor is best for databases? (Gives the shortest transfer + decompression time)
Which compressor is best for one-time data transfer (Gives the shortest compression + transfer + decompression time)
How much can be gained by switching from gzip to a more efficient compressor?

Scope

We benchmark lossless compression of FASTA files, without using reference sequences. Compression of FASTQ files or other data types, as well as compression with reference, are currently outside the scope of this project. Please see our Links page for other projects.

Note that some of the included compressors, particularly compressors for short reads in FASTQ format, are designed for a different task and under different assumptions than those used in our benchmark. Performance shown by such compressors in our benchmark should not be taken as indicative of their normal performance on the data they were designed for.

Disclaimer

Even though we did our best effort to conduct a fair and relevant benchmark, we provide no guarantees of any kind. Our results may be influenced by our hardware, test data and methodology, including the design of wrappers for compressors that need them. We recommend doing your own tests on your own machine and with your own data before deciding which compressor to use.

Disclosure

Our own compressor (NAF) is included in this benchmark. It did not receive any special treatment or unfair advantage.

Latest changes

Added protein sequence compressor AC.
Added two viral datasets: "NCBI Virus RefSeq Protein" and "NCBI Virus Complete Nucleotide Human".
Added general purpose compressor BriefLZ.
Added specialized compressors: Minicom, FQSqueezer.
Updated zstd, GTZ, fqzcomp, Nakamichi.
Added "Missing Compressors" page.

Next steps

Benchmark GeCo3.

Future plans

We will continue adding and updating compressors, datasets, and improving presentation, as our time permits. Suggestions are welcome!

Eventually we may also add FASTQ data.

Citation

Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi (2020) "Sequence Compression Benchmark (SCB) database — A comprehensive evaluation of reference-free compressors for FASTA-formatted sequencess" GigaScience, Volume 9, Issue 7, July 2020, giaa072, doi:10.1093/gigascience/giaa072.