Sequence Compression Benchmark

Test datasets

Every test dataset is a single FASTA-formatted file. We tried to include broad variety of commonly used DNA, RNA and protein datasets. Suggestions for additional interesting datasets are welcome!

All data is available for download at the ↓ links (in the naf format, selected for providing good combination of compactness and decompression speed).

Genomes

Genomes are DNA datasets. Typically they have long sequences with relatively few repetitions, and compress poorly. All genomes we used are from NCBI Assembly. These genomes were selected more or less randomly with the purpose of covering variety of sizes.

Other DNA datasets

These datasets are highly repetitive, giving the compressors plenty of redundancy to work with.

Dataset

Size

Number of
sequences

Date

↓

Source

Mitochondrion

245 MB

9,402

2019-03-15

↓

(36.1 MB)

Collection of mostly complete mitochondrial genomes from NCBI (1, 2).

NCBI Virus Complete Nucleotide Human

482 MB

36,745

2020-05-11

↓

(9.27 MB)

From NCBI Virus

Influenza

1.22 GB

700,001

2019-04-27

↓

(13.8 MB)

Entire set of sequences from the Influenza Virus Database (source).

Helicobacter

2.76 GB

108,292

2019-04-24

↓

(130 MB)

1,622 genomes of Helicobacter (including H. pylori), all available at GenBank and RefSeq as of 2019-04-24, obtained from NCBI Assembly.

NCBI SARS-CoV-2 random-100k

3.05 GB

373,332

2022-01-17

↓

(5.82 MB)

100,000 SARS-CoV-2 genomes randomly selected out of the entire set of SARS-CoV-2 genomes downloaded from GenBank on 2022-01-17. Only sequences of at least 25 kbp were used for sampling.

RNA datasets

These are highly repetitive single gene RNA databases.

Dataset

Size

Number of
sequences

Date

↓

Source

SILVA 132 LSURef

610 MB

198,843

2017-12-11

↓

(12.6 MB)

source link

SILVA 132 SSURef Nr99

1.11 GB

695,171

2017-12-11

↓

(41.7 MB)

source link

SILVA 132 SSURef

3.28 GB

2,090,668

2017-12-11

↓

(78.1 MB)

source link

Multiple Sequence Alignments

Aligned DNA sequences stored in FASTA format.

Dataset

Size

Number of
sequences

Date

↓

Source

UCSC hg38 7way knownCanonical-exonNuc

340 MB

1,470,154

2014-06-06

↓

(44.5 MB)

Alignments of 6 vertebrate genomes with human for CDS regions. Downloaded from UCSC, source (removed empty lines to normalize the FASTA format).

UCSC hg38 20way knownCanonical-exonNuc

969 MB

4,211,940

2015-06-30

↓

(75.9 MB)

Alignments of 19 mammalian genomes with human for CDS regions. Downloaded from UCSC, source (removed empty lines to normalize the FASTA format).

Protein databases

Protein data compresses more poorly than DNA/RNA (and sometimes even called incompressible). However they are also important datasets. Unfortunately most of sequence compressors don't support protein data.

Dataset

Size

Number of
sequences

Date

↓

Source

PDB

67.6 MB

109,914

2019-04-09

↓

(10.1 MB)