Sequence Compression Benchmark

Missing Compressors

This page lists some sequence compressors that we did not benchmark (yet).

We'll appreciate any help with getting any of these compressors to work.

AFRESh

"We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes."

Only Windows executable is available. Therefore we can't include it in current form, since this benchmark is done under Linux.

To do: Contact authors about releasing source or Linux binary, possibly try running it using Wine.

AQUa

"This article proposes AQUa, an adaptive framework for lossless compression of quality scores. To compress these quality scores, AQUa makes use of a configurable set of coding tools, extended with a Context-Adaptive Binary Arithmetic Coding scheme."

It's a compressor for quality scores only.

Assembltrie

A compressor for FASTQ data: "Assembltrie is a software tool for compressing collections of (fixed length) Illumina reads". Therefore it will need a wrapper.

"Currently, Assembltrie is the only FASTQ compressor that approaches the information theory limit for a given short read collection uniformly sampled from an underlying reference genome. Assembltrie becomes the first FASTQ compressor that achieves both combinatorial optimality and information theoretic optimality under fair assumptions."

To do: Download, build, test, make wrapper, benchmark.

BdBG

"BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs."

It seems that it always reorders reads. This makes it incompatible with our requirements, since we only benchmark lossless compression.

BIND

"By adopting a unique 'block-length' encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma)."

Link from the paper is dead: http://metagenomics.atc.tcs.com/compression/BIND

To do: Try contacting the authors for obtaining a copy of this compressors.

biocompress

"We propose a lossless algorithm to compress the information contained in DNA sequences. None of the available universal algorithms compress such data."

Not sure where to find this compressor.

biocompress-2

"We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, leads to the highest compression of DNA."

Does not seem to compile with recent GCC.

To do: Try to investigate and fix the issues.

DARRC

"In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph."

Paper is not open access. Abstract has no link to software.

DNACompress

"While achieving the best compression ratios for DNA sequences, our new DNACompress program significantly improves the running time of all previous DNA compression programs."

Not available anymore, link from paper is dead.

Requires PatternHunter. Therefore it's an officially dead project, because PatternHunter is not available anymore: "Unfortunately PatternHunter is an old product and is not distributed anymore." - email from its vendor.

DNAC-SBE

"DNAC-SBE is a lossless hybrid compressor that consists of three phases. First, starting from the largest base (Bi), the positions of each Bi are replaced with ones and the positions of other bases that have smaller frequencies than Bi are replaced with zeros. Second, to encode the generated streams, we propose a new single-block encoding scheme (SEB) based on the exploitation of the position of neighboring bits within the block using two different techniques. Finally, the proposed algorithm dynamically assigns the shorter length code to each block. Results show that DNAC-SBE outperforms state-of-the-art compressors and proves its efficiency in terms of special conditions imposed on compressed data, storage space and data transfer rate regardless of the file format or the size of the data."

No software is shared in the paper.

DualFqz

We propose a nanopore quality scores compressor, called DualCtx, which yields significant improvements in compression performance with respect to the state-of-the-art. We also extend DualCtx to a full FASTQ compressor, termed DualFqz, by substituting DualCtx for the quality score compression module in a variant of Fqzcomp.

fqzcomp is already included in the benchmark, and currently we don't use any FASTQ test data. Therefore adding this compressor would be redundant.

FaStore

"FaStore is a high-performance short FASTQ sequencing reads compressor."

Always reorders the reads, making it unsuitable for lossless compression.

GABAC

It seems this is not a complete standalone compressor, but an "entropy codec", supposed to be used within the context of other actual compressors. Therefore it's those other compressors that should be benchmarked, not this codec by itself. (As far as I understand).

GenCompress

"We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences"

Executable files are offered at the download page (https://www.cs.cityu.edu.hk/~cssamk/gencomp/downGen.htm). Therefore they are most likely Windows binaries. Thus we can't easily use them in current form.

To do: Try running the binaries, may be using Wine.

Genie

A compressor for FASTQ data.

To do: Download, build, test, make wrapper, benchmark.

G-SQZ

"We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order."

Link from the paper is dead: http://public.tgen.org/sqz

Does not seem to be available anymore. Email from the paper is dead too.

KungFQ

"We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma)."

Made with C# + .net.

To do: Try running it on Linux, possibly using Mono.

LCTD

"Instead of elaborating excellent data structure and compression technique based on the original FASTQ file, we try to change the distribution of original FASTQ file so as to make it better for further compression by existing compression tools."

"The source program is available by sending email to us."

No web-site. Author does not return emails.

LW-FQZip

"This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data."

This is a reference-based compressor, therefore can't participate in our benchmark.

LW-FQZip 2

"LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism."

This is a reference-based compressor, therefore can't participate in our benchmark.

MINCE

"We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file."

Looks like it is based on reordering reads, therefore it can't take part in our lossless compression benchmark.

MZPAQ

"Our tool is a hybrid of MFCompress (v 1.01) and ZPAQ (v 7.15), hence the name MZPAQ. In order to compress a FASTQ file, MZPAQ scans the input file and divides it into the four streams of FASTQ format. The first two streams (i.e. read identifier and read sequence) are compressed using MFCompress after the identifier stream is pre-processed to comply with the format restrictions of MFCompress. The third stream is discarded [...]. The fourth stream is compressed using the strong context-mixing algorithm ZPAQ."

Paper has no link to software. Despite journal name, no source code is shared in the paper.

To do: possibly re-implement and benchmark, if anyone has interest.

Off-Line

"Compression of texts via greedy off-line textual substitution refers to the possibility to identify a particularly redundant word in the text and to replace all of its non-overlapped occurrences (but one) with pointers, in order to get the highest possible compression; the process is then iterated on the compressed text until a word capable of producing further compression can no longer be found."

Can't build it so far with modern GCC.

To do: Try to investigate and fix compilation issues.

ORCOM

"Overlaping Reads COmpression with Minimizers is a compressor of sequencing reads. It takes as an input FASTQ files (possibly gzipped) and stores the DNA symbols of each read in a highly-compressed form. Id and quality fields are not stored. Thus, ORCOM cannot be treated as a full-fledged FASTQ compressor."

It seems to always re-order the reads. This makes it impossible to use it for lossless compression.

PgRC

"Pseudogenome-based Read Compressor (PgRC) is an in-memory algorithm for compressing the DNA stream of FASTQ datasets, based on the idea of building an approximation of the shortest common superstring over high-quality reads."

To do: Download, build, test, make wrapper, benchmark.

QuickTsaf

Former name for KungFQ.

ReCoil

"In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data."

Paper has no link to software. Author does not return emails.

RP/GP2

"Proposed algorithm will be based on combinations of reverse and palindrome (RP) technique or genetic palindrome and palindrome (GP2) technique substring substitution. The variable substring will be replaced by the corresponding American Standard Code for Information Interchange (ASCII) code which is extracted RP/GP2"

"The DNA sequence security is ensured by signature which depends on ASCII code and dynamic library file acting as a key. This approach shows that 95% of original file is modified when 44-45% is encrypted. The experimental results shows that the compression rate is 3.7750 bits/base."

No software is shared in the paper.

SCALCE

"Here we present SCALCE, a 'boosting' scheme based on Locally Consistent Parsing technique, which reorganizes the reads in a way that results in a higher compression speed and compression rate, independent of the compression algorithm in use and without using a reference genome."

Can't use it because of "Segmentation fault" crash (Issue #8).

In addition, it seems that it always reorders reads, which makes it lossy and therefore incompatible with our requirements.

SeqCompress

"This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences."

No software is shared in the paper. None of the authors return emails.

SOLiDzipper

"In SOLiDzipper, the non-sequence information including the sequence IDs and number in plain text format is encoded by a general purpose compression algorithm (ie, gzip, bzip2, lzma(LZMA SDK)), whereas the sequence information consisting of '0123' in csfasta format, which has random patterns and thus a low encoding efficiency, is encoded by bitwise and shift operations."

Link from the paper is dead: http://szipper.dinfree.com/

Author's email response: "Unfortunately we couldn't continue the research and there is no left software materials."

SRComp

"In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences."

Links from the paper are dead: http://www1.spms.ntu.edu.sg/~chenxin/SRComp, http://www.cs.mu.oz.au/~rsinha/resources/source/sort/allsorts/allsorts.zip.

Email from paper is dead as well.

slimfastq

"slimfastq would efficiently compresses/decompresses fastq files."

To do: Download, build, test, make wrapper, benchmark.

WBFQC

"This paper presents a lossless non-reference-based FastQ file compression approach, segregating the data into three different streams and then applying appropriate and efficient compression algorithms on each. Experiments show that the proposed approach (WBFQC) outperforms other state-of-the-art approaches for compressing NGS data in terms of compression ratio (CR), and compression and decompression time."

To do: Download, build, test, make wrapper, benchmark.