Sequence Compression Benchmark

Criteria

We include compressors suitable to practical tasks of sequence comparison. Specifically, we benchmark lossless reference-free compression of well-formed FASTA files. We use all specialized sequence compressors that we could find and make to work. For general-purpose compressors we use only the major ones, in terms of performance, historical importance, or popularity. Suggestions for adding more compressors are welcome!

Any included compressors must:

Below is the list of all tested compressors with brief comments. However please check the benchmark data for more complete picture. Better yet, install and evaluate any promising compressors on your own machine and with your own data.

Missing Compressors page lists compressors that are not included.


Jump to:

Specialized compressors: 2bit ac alapy beetl blast dcom dlim dnax dsrc fastqz fqs fqzcomp geco gtz harc jarvis kic leon lfastqc lfqc mfc minicom naf nuht pfish quip spring uht xm

General-purpose compressors: bcm brieflz brotli bsc bzip2 cmix copy gzip lizard lz4 lzop lzturbo nakamichi pbzip2 pigz snzip xz zpaq zpipe zstd


Specialized compressors

2bit

2bit is a database format used by BLAT: https://genome.ucsc.edu/FAQ/FAQblat. It used to be limited to 4 GB input, but recently support for long input has been finally added with "-long" switch.

Version tested: "faToTwoBit" and "twoBitToFa" binaries dated 2018-11-07, from UCSC: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/.

Comments: No support for RNA or protein sequences. Requires wrapper to preserve sequence names, line lengths and IUPAC characters. Non-free to use: "Blat source and executables are freely available for academic, nonprofit and personal use. Commercial licensing information is available on the Kent Informatics website." - from FAQ: https://genome.ucsc.edu/FAQ/FAQblat

ac

Version tested: 1.1, 2020-01-29, commit fc136fc, built from source.

alapy

"ALAPY Compressor - is a cross-platform software tool used for efficient compression of NGS data. Latest version utilizes lossless compression algorithm developed by our data scientists for fastq files and optimized for the latest sequencing machines from Illumina."

Version tested: 1.3.0, 2017-07-25, binary from GitHub.

Comments: Closed source and non-free. Limited to one instance at a time. alapy-b (alapy_arc -l b) performs nearly identically to fastqz-slow (fastqz c).

beetl

BEETL: Burrows-Wheeler Extended Tool Library, from Ilumina.

Version tested: commit 327cc65, 2019-11-14, built from source.

Comments: Requires sequences of identical length. Works only on short sequences. Not a complete compressor - it only computes BWT which then has to be compressed with another compressor (zstd in this benchmark).

blast

Database format of BLAST, the most popular homology search tool.

Version tested: "convert2blastmask", "makeblastdb" and "blastdbcmd" binaries from BLAST 2.8.1+, 2018-11-26, 64-bit Linux binaries from FTP: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.8.1/ncbi-blast-2.8.1+-x64-linux.tar.gz

Comments: Does not preserve line length, does not support RNA. It's a multi-file format, even tiny data is represented in several files when converted to blast format.

dcom

DNA-COMPACT is a DNA-only compressor. It can compress with and without reference. (Only compression without reference was tested here).

Version tested: Built from latest public source on Sourceforge (https://sourceforge.net/projects/dnacompact/files/, dated 2013-08-29

Comments: Does not support FASTA format, expects single nameless DNA sequence as input. Basic functionality has to be added via wrapper. This includes support of FASTA input and output, support for 'N' and IUPAC codes, support for masked sequence. Creates a temporary file 'tmphuff.txt' in current directory. This results in problems when running multiple DNA-COMPACT compression tasks in parallel: some of them crash and produce corrupted compressed files. Needs 2 decompression commands. Fails with "Segmentation fault", needs "ulimit -s unlimited" before running to avoid this crash. (All these issues are also worked around in the wrapper).

dlim

Version tested: Version 1.3c, binary received from authors by email.

Comments: No website - the only way to obtain this compressor is by contacting the authors. Closed source. Not free to use: "Kindly note that this tool is free for academic use. In case you plan to use it commercially, kindly get in touch with [authors]" - email from one of the authors. Creates temporary files in the current directory, which causes crashes when running parallel compression tasks. Has no streaming mode. Relies on "7za" binary.

dnax

Version tested: dnaX 0.1.0, source received from authors by email and built using bundled makefile.

Comments: No website - the only way to obtain this compressor is by contacting the authors. Creates temporary files in the current directory. Has no streaming mode. Does not support FASTA format, N, IUPAC codes, mask, RNA and protein sequences. Memory consumption growth proportionally to data size. Does not support data larger than 2 GB. Crashes on some data (reported to author). dnax-0 (dna0) freezes while decompressing a poly-A repeat of length 100. dnax-1 (dna1) freezes while decompressing a poly-A repeat of length 333. dnax-1 and dnax-2 corrupt data when compressing poly-A repeat of length 333.

dsrc

Version tested: "2.02 @ 30.09.2014", commit 5eda82c, 2015-06-04, built with make -f Makefile.c++11 bin.

Comments: Corrupts data if input contains non-ACGT characters (such as "H") (Issue #24). Crashes on input containing single read (Issue #26).

fastqz

Version tested: 1.5, 2012-03-15, obtained from GitHub mirror, commit 39b2bbc, built after changing -lpthread to -pthread in Makefile.

Comments: It's tuned to specific distribution of qualities.

fqs

"We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world."

Version tested: FQSqueezer 0.1, commit 5741fc5, 2019-05-17.

fqzcomp

Version tested: 4.6, commit 96f2f61, 2019-12-02.

Comments: fqzcomp -s9 crashes during compression (Issue #2).

geco

GeCo version tested: v.2.1, 2016-12-24, built from source, commit 5569304.

Geco2 version tested: v.1.1, 2019-02-02, built from source, commit 062a8c0.

Comments: Compresses only DNA sequence. Does not support FASTA format, N, IUPAC codes, mask.

gtz

Version tested: GTX.Zip PROFESSIONAL-2.1.3-V-2020-03-18 07:11:20, binary from https://gtz.io/gtz_latest.run

Comments: Does not accept data from standard input during compression. Refuses to compress if output file name does not end with ".gtz". According to EULA it phones home. GTZ.Zip Profesional expires and stops working 6 months after installation, rendering it useless for reproducible experiments (Issue #20). Installation script modifies user's .bashrc file without asking or notifying the user (Issue #19). Closed source and non-free.

harc

Version tested: HARC commit cf35caf, 2019-10-04, built from source.

Comments: Has to be run from its own source directory. Recompiles itself on every run, which would be problematic in case of trying to run multiple harc compression tasks in parallel. Uses 7z and bsc binaries. GitHub repository has no license.

jarvis

JARVIS appears to be a further development of GeCo and GeCo2.

Version tested: JARVIS v.1.1, 2019-04-30, built from source, commit d7daef5

Comments: Compresses only DNA sequence. Does not support FASTA format, N, IUPAC codes, mask.

kic

Version tested: KIC 0.2, 2015-11-25, binary from homepage: http://www.ysunlab.org/dist/kic.V0.2.zip.

Comments: Closed source. Uses 4 cores by default, trying to change to 1 core with "-n 1" always crashes.

leon

Version tested: Leon 1.0.0, 2016-02-27, Linux binary from GitHub: https://github.com/GATB/leon/releases.

Comments: Massively slows down with the increased read length. Does not support IUPAC codes. Has no streaming mode. Crashes when compressing sequence with 1-character name (Issue #6). Uses current directory for temporary files (Issue #7). Generates broken read numbers when decompressing archive with no headers (Issue #8). Crashes on some large data (Issue #9). Does not allow specifying output file name.

lfastqc

Version tested: LFastqC commit 60e5fda, 2019-02-28, with necessary fixes.

Comments: Works only when executed from its directory. Expects input in its directory. Uses current directory for temporary files, instead of TMPDIR. Fails during tar step. Expects sequence names to have identical length. During decompression it attempts to read incomplete sequence data while it's still being written by the MFCompressD. Uses " | grep Hello" to silence colsole output of compressors that it uses. Not free since it depends on non-free MFCompress. Compression fails on 2.76 GB Helicobacter dataset and on larger data.

lfqc

Version tested: LFQC commit 59f56e0, 2016-01-06, with added fix from Issue #4. Also, parallel processing of names, sequences and qualities in lfqc.rb is changed to sequential to fix compression failures.

Comments: Corrupts data with irregularly formatted read names (Issue #5). Critical Issue #4 is closed but not fixed. Compression fails due to race condition (fixed by disabling parallel compression of names, sequences and qualities). Has to be run from its source directory. Uses zpaq with 4 threads, with no option to disable multithreading.

mfc

Version tested: MFCompress 1.01, 2013-09-03, 64-bit Linux binary from homepage: http://bioinformatics.ua.pt/software/mfcompress/.

Comments: Supports only DNA data. Has no streaming mode. Not free to use: "available for non-commercial use. For other uses, please send an email to [author's email]" - homepage.

minicom

"Minicom is a tool for compressing short reads in FASTQ. The minicom program is written in C++11 and works on Linux. It is availble under an open-source license."

The main minicom program is a shell script. It calls other tools, including bsc, 7z, md5sum, head, cp, mv, mkdir, tar, rm, make, as well as their own C++ code, which is recompiled on every run (this is where "make" comes in).

Version tested: commit 2360dd9, 2019-09-09.

Comments: Does not reproduce the original FASTQ file during decompression, but only sequence. Corrupts data with 5.6 GB input (Issue #3). Recompiles its C++ code for every run. Has to be run from within its directory. Automatically names output files. Has no streaming mode.

naf

Version tested: 1.1.0, 2019-10-01, built from source, obtained from GitHub: https://github.com/KirillKryukov/naf.

nuht

Version tested: commit 08a42a8, 2018-09-26, Linux binary.

Comments: Paper is not open access. Closed source. Uses 30x times memory compared to input size. Auto-names output files.

pfish

Version tested: Pufferfish v.1.0 alpha, 2012-04-11, built from source, commit f1ddc4a.

Comments: Does not support FASTA format. Leaks memory during decompression (Issue #2). Fails on large data, such as 33 GB salamander genome (Issue #1).

quip

Version tested: Quip 1.1.8-8-g9165bb5, 2017-12-17, built from source, commit 9165bb5. Only compression without assembly is tested here.

Comments: Does not support non-standard sequence characters. Crashes during decompression if compressed file name does not end with ".quip", also if the compressed file is not in current directory.

spring

Version tested: SPRING commit 6536b1b, 2019-11-28, built from source.

Comments: Paper is not open access. GitHub repository has no license.

uht

Version tested: UHT binaries from 2016-12-27, downloaded from GitHub: https://github.com/aalokaily/Unbalanced-Huffman-Tree.

Comments: Closed source. Does not support masked sequence. Fails on 245 MB dataset and larger datasets.

xm

Version tested: 3.0, commit 9b9ea57, 2019-01-07.

Comments: May corrupt data (Issue #29).


General-purpose compressors

bcm

Version tested: BCM 1.30, 2018-01-21, commit 24b6017, built with: g++ -o bcm -O3 -march=native -ffast-math -s bcm.cpp divsufsort.c.

Comments: No streaming mode.

brieflz

Version tested: BriefLZ 1.3.0, 2020-02-15, commit 0ab07a5, built from source using: mkdir build; cd build; cmake -DCMAKE_BUILD_TYPE=Release ..; cmake --build . --config Release.

Comments: No streaming mode. Unpredictable compression speed when using "--optimal" setting (Issue #11).

brotli

Version tested: 1.0.7, 2018-10-23.

bsc

Version tested: 3.1.0, 2016-01-01, commit 3dea347, built from source using bundled makefile.

Comments: No streaming mode.

bzip2

Version tested: 1.0.6, 2010-09-06

cmix

Version tested: 17, 2019-03-24.

Comments: No streaming mode. Because of compression speed of less than 1 kB/s, it is currently benchmarked only on data smaller than 10 MB.

copy

"Copy" compressors don't compress the data, but make its exact uncompressed duplicate. Such processes tested here include the "cat" command, and the "-0" mode of pigz. They are included for control.

gzip

Version tested: 1.6, 2013-06-09, default install that came with the OS (Ubuntu).

lizard

Version tested: 1.0.0, commit dda3b33, 2019-03-08.

lz4

Version tested: LZ4 1.9.1, 2019-04-24.

lzop

Version tested: 1.04, 2017-08-10.

lzturbo

Version tested: 1.2, 2014-08-11.

Comments: Closed source.

nakamichi

Version tested: Nakamichi 2020-May-09 (archived), built from source, using command: gcc -O3 -static -msse4.1 -fomit-frame-pointer Nakamichi_Ryuugan-ditto-1TB_btree.c -o nakamichi -D_N_XMM -D_N_prefetch_4096 -D_N_alone -DHashInBITS=24 -DHashChunkSizeInBITS=24 -DRAMpoolInKB=5120 -DBtreeHEURISTIC -D_POSIX_ENVIRONMENT_ -DLongestLineInclusive=128 -DSpeedUpBuilding=32 -DLITE.

Comments: Requires massive amount of memory. Does not support streaming for compression. Fills console with ASCII art and irrelevant texts. Creates multiple log files. Refuses to decompress any files with names not ending with ".Nakamichi". Version 2020-May-09 is already unavailable on both homepages, so I mirror it here in minimal form. Due to slowness it is currently only tested on datasest smaller than 200 MB.

pbzip2

Version tested: 1.1.13, 2015-12-18.

pigz

Version tested: 2.4, 2017-12-26.

snzip

Based on the Snappy compression library.

Version tested: 1.0.4, 2016-10-02.

xz

Based on the LZMA algorithm.

Version tested: 5.2.2, 2015-09-29.

zpaq

Version tested: 7.15, 2016-08-17.

Comments: No streaming mode.

zpipe

Version tested: 2.01, 2010-12-23, built from source (http://mattmahoney.net/dc/zpipe.201) with libzpaq 4.00, 2011-11-13 (http://mattmahoney.net/dc/libzpaq400.zip), built with: g++ -o zpipe -O3 -march=native -ffast-math -s zpipe.cpp libzpaq.cpp.

zstd

Version tested: 1.4.5, 2020-05-22, built from source using bundled makefile.