Sequence Compression Benchmark

Missing Compressors

This page lists some sequence compressors that we did not benchmark (yet).

We'll appreciate any help with getting any of these compressors to work.

AFRESh

Paper: Tom Paridaens, Glenn Van Wallendael, Wesley De Neve, Peter Lambert (2017) "AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality" Bioinformatics, 33(10), 1464-1472, https://doi.org/10.1093/bioinformatics/btx001
GitHub: https://github.com/tparidae/AFresh

"We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes."

Only Windows executable is available. Therefore we can't include it in current form, since this benchmark is done under Linux.

To do: Contact authors about releasing source or Linux binary, possibly try running it using Wine.

AQUa

Paper: Tom Paridaens, Glenn Van Wallendael, Wesley De Neve, Peter Lambert (2018) "AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality" Bioinformatics, 34(3), 425-433, https://doi.org/10.1093/bioinformatics/btx607
GitHub: https://github.com/tparidae/AQUa

"This article proposes AQUa, an adaptive framework for lossless compression of quality scores. To compress these quality scores, AQUa makes use of a configurable set of coding tools, extended with a Context-Adaptive Binary Arithmetic Coding scheme."

It's a compressor for quality scores only.

Assembltrie

Paper: Antonio A. Ginart, Joseph Hui, Kaiyuan Zhu, Ibrahim Numanagic, Thomas A. Courtade, S. Cenk Sahinalp, David N. Tse (2018) "Optimal compressed representation of high throughput sequence data via light assembly" Nature Communications, 9, 566, https://doi.org/10.1038/s41467-017-02480-6
GitHub: https://github.com/kyzhu/assembltrie

A compressor for FASTQ data: "Assembltrie is a software tool for compressing collections of (fixed length) Illumina reads". Therefore it will need a wrapper.

"Currently, Assembltrie is the only FASTQ compressor that approaches the information theory limit for a given short read collection uniformly sampled from an underlying reference genome. Assembltrie becomes the first FASTQ compressor that achieves both combinatorial optimality and information theoretic optimality under fair assumptions."

To do: Download, build, test, make wrapper, benchmark.

BdBG

Paper: Rongjie Wang, Junyi Li, Yang Bai, Tianyi Zang, Yadong Wang (2018) "BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs" PeerJ, 6, e5611, https://www.doi.org/10.7717/peerj.5611
GitHub: https://github.com/rongjiewang/BdBG

"BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs."

It seems that it always reorders reads. This makes it incompatible with our requirements, since we only benchmark lossless compression.

BIND

Paper: Tungadri Bose, Monzoorul Haque Mohammed, Anirban Dutta, Sharmila S. Mande (2012) "BIND - An algorithm for loss-less compression of nucleotide sequence data" Journal of Biosciences, 37, 785-789, https://www.doi.org/10.1007/s12038-012-9230-6

"By adopting a unique 'block-length' encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma)."

Link from the paper is dead: http://metagenomics.atc.tcs.com/compression/BIND

To do: Try contacting the authors for obtaining a copy of this compressors.

biocompress

Paper: Stephane Grumbach, Fariza Tahi (1993) "Compression of DNA sequences" Data Compression Conference, 1993 (DCC'93), 340-350, https://www.doi.org/10.1109/DCC.1993.253115

"We propose a lossless algorithm to compress the information contained in DNA sequences. None of the available universal algorithms compress such data."

Not sure where to find this compressor.

biocompress-2

Paper: Stephane Grumbach, Fariza Tahi (1994) "A New Challenge for Compression Algorithms: Genetic Sequences" Information Processing & Management, 30(6), 875-886, https://www.doi.org/10.1016/0306-4573(94)90014-0
Homepage: https://who.rocq.inria.fr/Stephane.Grumbach/biocompress.html

"We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, leads to the highest compression of DNA."

Does not seem to compile with recent GCC.

To do: Try to investigate and fix the issues.

CDNA

Paper: David Loewenstern, Peter N. Yianilos (1999) "Significantly Lower Entropy Estimates for Natural DNA Sequences" Journal of Computational Biology, 6(1), 125-142, https://www.doi.org/10.1089/cmb.1999.6.125

We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences.

No software is shared in the paper.

Cfact

Paper: Eric Rivals, Jean-Paul Delahaye, Max Dauchet, Oliver Delgrange (1995) "A Guaranteed Compression Scheme for Repetitive DNA Sequences" Technical Report IT-95-285, LIFL, Universite des Sciences et Technologies de Lille
Paper: E. Rivals, J-P. Delahaye, M. Dauchet, O. Delgrange (1996) "A Guaranteed Compression Scheme for Repetitive DNA Sequences" Data Compression Conference (DCC '96), https://www.doi.org/10.1109/DCC.1996.488385

In the parsing phase, the suffix tree is built to select repeats for the dictionary. In the encoding phase, selected repetitions for which a guarantee of gain is esablished, are encoded.

Link from the paper is dead.

DARRC

Paper: Guillaume Holley, Roland Wittler, Jens Stoye, Faraz Hach (2018) "Dynamic Alignment-Free and Reference-Free Read Compression" Journal of Computational Biology, 25(7), 825-836, https://doi.org/10.1089/cmb.2018.0068

"In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph."

Paper is not open access. Abstract has no link to software.

DNABIT Compress

Paper: Pothuraju Rajarajeswari, Allam Apparao (2011) "DNABIT Compress - Genome compression algorithm" Bioinformation, 5(8): 350-360, https://doi.org/10.6026/97320630005350

Our proposed algorithm achieves the best compression ratio for DNA sequences for larger genome. Significantly better compression results show that "DNABIT Compress" algorithm is the best among the remaining compression algorithms. While achieving the best compression ratios for DNA sequences (Genomes),our new DNABIT Compress algorithm significantly improves the running time of all previous DNA compression programs.

No software is shared in the paper.

DNACompress

Paper: Xin Chen, Ming Li, Bin Ma, John Tromp (2002) "DNACompress: fast and effective DNA sequence compression" Bioinformatics, 18(12), 1696-1698, https://doi.org/10.1093/bioinformatics/18.12.1696

"While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs."

Not available anymore, link from paper is dead.

Requires PatternHunter. Therefore it's an officially dead project, because PatternHunter is not available anymore: "Unfortunately PatternHunter is an old product and is not distributed anymore." - email from its vendor.

DNAC-SBE

Paper: Deloula Mansouri, Xiaohui Yuan, Abdeldjalil Saidani (2020) "A New Lossless DNA Compression Algorithm Based on A Single-Block Encoding Scheme" Algorithms, 13, 99, https://doi.org/10.3390/a13040099

"DNAC-SBE is a lossless hybrid compressor that consists of three phases. First, starting from the largest base (Bi), the positions of each Bi are replaced with ones and the positions of other bases that have smaller frequencies than Bi are replaced with zeros. Second, to encode the generated streams, we propose a new single-block encoding scheme (SEB) based on the exploitation of the position of neighboring bits within the block using two different techniques. Finally, the proposed algorithm dynamically assigns the shorter length code to each block. Results show that DNAC-SBE outperforms state-of-the-art compressors and proves its efficiency in terms of special conditions imposed on compressed data, storage space and data transfer rate regardless of the file format or the size of the data."

No software is shared in the paper.

DualFqz

Paper: Guillermo Dufort y Alvarez, Gadiel Seroussi, Pablo Smircich, Jose Sotelo, Idoia Ochoa, Alvaro Martin (2019) "Compression of Nanopore FASTQ Files" In: Rojas I., Valenzuela O., Rojas F., Ortuno F. (eds) Bioinformatics and Biomedical Engineering, IWBBIO 2019, Lecture Notes in Computer Science, vol 11465. Springer, Cham, https://doi.org/10.1007/978-3-030-17938-0_4
GitHub: https://github.com/guidufort/DualFqz

We propose a nanopore quality scores compressor, called DualCtx, which yields significant improvements in compression performance with respect to the state-of-the-art. We also extend DualCtx to a full FASTQ compressor, termed DualFqz, by substituting DualCtx for the quality score compression module in a variant of Fqzcomp.

fqzcomp is already included in the benchmark, and currently we don't use any FASTQ test data. Therefore adding this compressor would be redundant.

FaStore

GitHub: https://github.com/refresh-bio/FaStore

"FaStore is a high-performance short FASTQ sequencing reads compressor."

Always reorders the reads, making it unsuitable for lossless compression.

GABAC

Paper: Jan Voges, Tom Paridaens, Fabian Muntefering, Liudmila S. Mainzer, Brian Bliss, Mingyu Yang, Idoia Ochoa, Jan Fostier, Jorn Ostermann, Mikel Hernaez (2020) "GABAC: an arithmetic coding solution for genomic data" Bioinformatics, 36(7), 2275-2277, https://doi.org/10.1093/bioinformatics/btz922
GitHub: https://github.com/mitogen/gabac

It seems this is not a complete standalone compressor, but an "entropy codec", supposed to be used within the context of other actual compressors. Therefore it's those other compressors that should be benchmarked, not this codec by itself. (As far as I understand).

GenCompress

Paper: Xin Chen, Sam Kwong, Ming Li (1999) "A Compression Algorithm for DNA Sequences and Its Applications in Genome Comparison" Genome Informatics Workshop 1999 (GIW99), https://pubmed.ncbi.nlm.nih.gov/11072342/, PDF: https://www.jsbi.org/pdfs/journal1/GIW99/GIW99F06.pdf
Homepage: https://www.cs.cityu.edu.hk/~cssamk/gencomp/GenCompress1.htm

"We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences"

Executable files are offered at the download page (https://www.cs.cityu.edu.hk/~cssamk/gencomp/downGen.htm). Therefore they are most likely Windows binaries. Thus we can't easily use them in current form.

To do: Try running the binaries, may be using Wine.

Genie

GitHub: https://github.com/mitogen/genie

A compressor for FASTQ data.

To do: Download, build, test, make wrapper, benchmark.

G-SQZ

Paper: Waibhav Tembe, James Lowey, Edward Suh (2010) "G-SQZ: compact encoding of genomic sequence and quality data" Bioinformatics, 26(17), 2192-2194, https://doi.org/10.1093/bioinformatics/btq346

"We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order."

Link from the paper is dead: http://public.tgen.org/sqz

Does not seem to be available anymore. Email from the paper is dead too.

KungFQ

Paper: Elena Grassi, Federico Di Gregorio, Ivan Molineris (2012) "KungFQ: A Simple and Powerful Approach to Compress fastq Files" IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(6), 1837-1842, https://doi.org/10.1109/TCBB.2012.123
Homepage: http://quicktsaf.sourceforge.net/

"We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma)."

Made with C# + .net.

To do: Try running it on Linux, possibly using Mono.

LCTD

Paper: Jiabing Fu, Yacong Ma, Bixin Ke, Shoubin Dong (2016) "LCTD: a Lossless Compression Tool of FASTQ File Based on Transformation of Original File Distribution" 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 864-869, https://doi.org/10.1109/BIBM.2016.7822639

"Instead of elaborating excellent data structure and compression technique based on the original FASTQ file, we try to change the distribution of original FASTQ file so as to make it better for further compression by existing compression tools."

"The source program is available by sending email to us."

No web-site. Author does not return emails.

LW-FQZip

Paper: Yongpeng Zhang, Linsen Li, Yanli Yang, Xiao Yang, Shan He, Zexuan Zhu (2015) "Light-weight reference-based compression of FASTQ data" BMC Bioinformatics, 16, 188, https://doi.org/10.1186/s12859-015-0628-7
Homepage: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip/

"This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data."

This is a reference-based compressor, therefore can't participate in our benchmark.

LW-FQZip 2

Paper: Zhi-An Huang, Zhenkun Wen, Qingjin Deng, Ying Chu, Yiwen Sun, Zexuan Zhu (2017) "LW-FQZip 2: a parallelized reference-based compression of FASTQ files" BMC Bioinformatics, 18, 179, https://doi.org/10.1186/s12859-017-1588-x
Homepage: http://csse.szu.edu.cn/staff/zhuzx/lwfqzip2/

"LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism."

This is a reference-based compressor, therefore can't participate in our benchmark.

MINCE

Paper: Rob Patro, Carl Kingsford (2015) "Data-dependent Bucketing Improves Reference-Free Compression of Sequencing Reads" Bioinformatics, 31(17), 2770-2777, https://doi.org/10.1186/10.1093/bioinformatics/btv248
Homepage: http://www.cs.cmu.edu/~ckingsf/software/mince

"We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file."

Looks like it is based on reordering reads, therefore it can't take part in our lossless compression benchmark.

MZPAQ

Paper: Achraf El Allali, Mariam Arshad (2019) "MZPAQ: a FASTQ data compression tool" Source Code for Biology and Medicine, 14, 3, https://doi.org/10.1186/s13029-019-0073-5

"Our tool is a hybrid of MFCompress (v 1.01) and ZPAQ (v 7.15), hence the name MZPAQ. In order to compress a FASTQ file, MZPAQ scans the input file and divides it into the four streams of FASTQ format. The first two streams (i.e. read identifier and read sequence) are compressed using MFCompress after the identifier stream is pre-processed to comply with the format restrictions of MFCompress. The third stream is discarded [...]. The fourth stream is compressed using the strong context-mixing algorithm ZPAQ."

Paper has no link to software. Despite journal name, no source code is shared in the paper.

To do: possibly re-implement and benchmark, if anyone has interest.

OBComp

Paper: Deloula Mansouri, Xiaohui Yuan (2018) "One-Bit DNA Compression Algorithm" International Conference on Neural Information Processing (ICONIP 2018), 378-386, https://doi.org/10.1007/978-3-030-04239-4_34

Unlike direct coding technique where two bits are assigned to each nucleotide resulting compression ratio of 2 bits per byte (bpb), OBComp used just a single bit 0 or 1 to code the two highest occurrence nucleotides. The positions of the two others are saved.

No software is shared in the paper.

Off-Line

Paper: Alberto Apostolico, Stefano Lonardi (2020) "Off-line Compression by Greedy Textual Substitution" Proceedings of the IEEE, 88(11), November 2000, 1733-1744
Homepage: http://www.cs.ucr.edu/~stelo/Offline/

"Compression of texts via greedy off-line textual substitution refers to the possibility to identify a particularly redundant word in the text and to replace all of its non-overlapped occurrences (but one) with pointers, in order to get the highest possible compression; the process is then iterated on the compressed text until a word capable of producing further compression can no longer be found."

Can't build it so far with modern GCC.

To do: Try to investigate and fix compilation issues.

ORCOM

Paper: Szymon Grabowski, Sebastian Deorowicz, Lukasz Roguski (2015) "Disk-based compression of data from genome sequencing" Bioinformatics, 31(9), 1389-1395, https://doi.org/10.1093/bioinformatics/btu844
Homepage: http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=orcom&subpage=about

"Overlaping Reads COmpression with Minimizers is a compressor of sequencing reads. It takes as an input FASTQ files (possibly gzipped) and stores the DNA symbols of each read in a highly-compressed form. Id and quality fields are not stored. Thus, ORCOM cannot be treated as a full-fledged FASTQ compressor."

It seems to always re-order the reads. This makes it impossible to use it for lossless compression.

PgRC

Paper: Tomasz M. Kowalski, Szymon Grabowski (2020) "PgRC: pseudogenome-based read compressor" Bioinformatics, 36(7), 2082-2089, https://doi.org/10.1093/bioinformatics/btz919
GitHub: https://github.com/kowallus/PgRC

"Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads."

To do: Download, build, test, make wrapper, benchmark.

QuickTsaf

Former name for KungFQ.

ReCoil

Paper: Vladimir Yanovsky (2011) "ReCoil - an algorithm for compression of extremely large datasets of dna data" Algorithms for Molecular Biology, 6, 23, https://doi.org/10.1186/1748-7188-6-23

"In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data."

Paper has no link to software. Author does not return emails.

RP/GP²

Paper: Syed Mahamud Hossein, Debashis De, Pradeep Kumar Das Mohapatra (2019) "DNA sequence compression using RP/GP² method with information storage and security" Microsystem Technologies, https://doi.org/10.1007/s00542-019-04481-5

"Proposed algorithm will be based on combinations of reverse and palindrome (RP) technique or genetic palindrome and palindrome (GP²) technique substring substitution. The variable substring will be replaced by the corresponding American Standard Code for Information Interchange (ASCII) code which is extracted RP/GP²"

"The DNA sequence security is ensured by signature which depends on ASCII code and dynamic library file acting as a key. This approach shows that 95% of original file is modified when 44-45% is encrypted. The experimental results shows that the compression rate is 3.7750 bits/base."

No software is shared in the paper.

SCALCE

Paper: Faraz Hach, Ibrahim Numanagic, Can Alkan, S. Cenk Sahinalp (2012) "SCALCE: boosting sequence compression algorithms using locally consistent encoding" Bioinformatics, 28(23), 3051-3057, https://doi.org/10.1093/bioinformatics/bts593
GitHub: https://github.com/sfu-compbio/scalce
Homepage: http://sfu-compbio.github.io/scalce/

"Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome."

Can't use it because of "Segmentation fault" crash (Issue #8).

In addition, it seems that it always reorders reads, which makes it lossy and therefore incompatible with our requirements.

SeqCompress

Paper: Muhammad Sardaraz, Muhammad Tahir, Ataul Aziz Ikram, Hassan Bajwa (2014) "SeqCompress: An algorithm for biological sequence compression" Genomics, 104, 225-228, https://doi.org/10.1016/j.ygeno.2014.08.007

"This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences."

No software is shared in the paper. None of the authors return emails.

SOLiDzipper

Paper: Young Jun Jeon, Sang Hyun Park, Sung Min Ahn, Hee Joung Hwang (2011) "SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data" Evolutionary Bioinformatics Online, 7, 1-6, https://doi.org/10.4137/EBO.S6618

"In SOLiDzipper, the non-sequence information including the sequence IDs and number in plain text format is encoded by a general purpose compression algorithm (ie, gzip, bzip2, lzma(LZMA SDK)), whereas the sequence information consisting of '0123' in csfasta format, which has random patterns and thus a low encoding efficiency, is encoded by bitwise and shift operations."

Link from the paper is dead: http://szipper.dinfree.com/

Author's email response: "Unfortunately we couldn't continue the research and there is no left software materials."

SRComp

Paper: Jeremy John Selva, Xin Chen (2013) "SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding" PLoS ONE, 8(12), e81414, https://doi.org/10.1371/journal.pone.0081414

"In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences."

Links from the paper are dead: http://www1.spms.ntu.edu.sg/~chenxin/SRComp, http://www.cs.mu.oz.au/~rsinha/resources/source/sort/allsorts/allsorts.zip.

Email from paper is dead as well.

slimfastq

GitHub: https://github.com/Infinidat/slimfastq
SourceForge: https://sourceforge.net/projects/slimfastq/

"slimfastq would efficiently compresses/decompresses fastq files."

To do: Download, build, test, make wrapper, benchmark.

WBFQC

Paper: Sanjeev Kumar, Suneeta Agarwal, Ranvijay (2018) "WBFQC: A New Approach for Compressing Next-Generation Sequencing Data Splitting Into Homogeneous Streams" Journal of Bioinformatics and Computational Biology, 16(5), 1850018, https://doi.org/10.1142/S021972001850018X
Homepage: http://www.algorithm-skg.com/wbfqc/home.html

"This paper presents a lossless non-reference-based FastQ file compression approach, segregating the data into three different streams and then applying appropriate and efficient compression algorithms on each. Experiments show that the proposed approach (WBFQC) outperforms other state-of-the-art approaches for compressing NGS data in terms of compression ratio (CR), and compression and decompression time."

To do: Download, build, test, make wrapper, benchmark.

Sequence Compression Benchmark

Missing Compressors

AFRESh

AQUa

Assembltrie

BdBG

BIND

biocompress

biocompress-2

CDNA

Cfact

DARRC

DNABIT Compress

DNACompress

DNAC-SBE

DualFqz

FaStore

GABAC

GenCompress

Genie

G-SQZ

KungFQ

LCTD

LW-FQZip

LW-FQZip 2

MINCE

MZPAQ

OBComp

Off-Line

ORCOM

PgRC

QuickTsaf

ReCoil

RP/GP2

SCALCE

SeqCompress

SOLiDzipper

SRComp

slimfastq

WBFQC

RP/GP²