Sequence Compression Benchmark



Sequence databases store massive amounts of data, and keep growing rapidly. Efficiently storing and using such data is a major challenge. Currently most databases continue depending on gzip for sequence compression, even though more efficient compressors are now available.

In our own experience, we work with massive sequence data every day, and it's a headache to wait for data transfers. Decompression time also often impacts our data analysis pipelines. We'd like to minimize such costs, by choosing an optimal compressor depending on application.

We decided to benchmark available compressors and to explore these questions:


We benchmark lossless compression of FASTA files, without using reference sequences. Compression of FASTQ files or other data types, as well as compression with reference, are currently outside the scope of this project. Please see our Links page for other projects.

Note that some of the included compressors, particularly compressors for short reads in FASTQ format, are designed for a different task and under different assumptions than those used in our benchmark. Performance shown by such compressors in our benchmark should not be taken as indicative of their normal performance on the data they were designed for.


Even though we did our best effort to conduct a fair and relevant benchmark, we provide no guarantees of any kind. Our results may be influenced by our hardware, test data and methodology, including the design of wrappers for compressors that need them. We recommend doing your own tests on your own machine and with your own data before deciding which compressor to use.


Our own compressor (NAF) is included in this benchmark. It did not receive any special treatment or unfair advantage.

Latest changes

Next steps

Future plans

We will continue adding and updating compressors, datasets, and improving presentation, as our time permits. Suggestions are welcome!

Eventually we may also add FASTQ data.