Sequence Compression Benchmark

Test datasets

Every test dataset is a single FASTA-formatted file. We tried to include broad variety of commonly used DNA, RNA and protein datasets. Suggestions for additional interesting datasets are welcome!

All data is available for download at the links (in the naf format, selected for providing good combination of compactness and decompression speed).

Genomes

Genomes are DNA datasets. Typically they have long sequences with relatively few repetitions, and compress poorly. All genomes we used are from NCBI Assembly. These genomes were selected more or less randomly with the purpose of covering variety of sizes.

CategoryOrganismAccessionSizeSource
VirusGordonia phage GAL1GCF_001884535.150.7 kB
(12.4 kB)
NCBI FTP
BacteriaWS1 bacterium JGI 0000059-K21GCA_000398605.1522 kB
(125 kB)
NCBI FTP
ProtistAstrammina raraGCA_000211355.21.71 MB
(384 kB)
NCBI FTP
FungiNosema ceranaeGCA_000988165.15.81 MB
(1.37 MB)
NCBI FTP
ProtistCryptosporidium parvum Iowa IIGCA_000165345.19.22 MB
(2.29 MB)
NCBI FTP
ProtistSpironucleus salmonicidaGCA_000497125.113.1 MB
(3.15 MB)
NCBI FTP
ProtistTieghemostelium lacteumGCA_001606155.123.7 MB
(5.71 MB)
NCBI FTP
FungusFusarium graminearum PH-1GCF_000240135.336.9 MB
(9.32 MB)
NCBI FTP
ProtistSalpingoeca rosettaGCA_000188695.156.2 MB
(12.3 MB)
NCBI FTP
AlgaeChondrus crispusGCA_000350225.2106 MB
(16.0 MB)
NCBI FTP
AlgaeKappaphycus alvareziiGCA_002205965.2341 MB
(66.1 MB)
NCBI FTP
AnimalStrongylocentrotus purpuratusGCF_000002235.41.01 GB
(193 MB)
NCBI FTP
AnimalHomo sapiensGCA_000001405.283.31 GB
(691 MB)
NCBI FTP
PlantPicea abiesGCA_900067695.113.4 GB
(2.31 GB)
NCBI FTP

Other DNA datasets

These datasets are highly repetitive, giving the compressors plenty of redundancy to work with.

DatasetSizeNumber of
sequences
DateSource
Mitochondrion245 MB9,4022019-03-15
(36.1 MB)
Collection of mostly complete mitochondrial genomes from NCBI (1, 2).
NCBI Virus Complete Nucleotide Human482 MB36,7452020-05-11
(9.27 MB)
From NCBI Virus
Influenza1.22 GB700,0012019-04-27
(13.8 MB)
Entire set of sequences from the Influenza Virus Database (source).
Helicobacter2.76 GB108,2922019-04-24
(130 MB)
1,622 genomes of Helicobacter (including H. pylori), all available at GenBank and RefSeq as of 2019-04-24, obtained from NCBI Assembly.
NCBI SARS-CoV-2 random-100k3.05 GB373,3322022-01-17
(5.82 MB)
100,000 SARS-CoV-2 genomes randomly selected out of the entire set of SARS-CoV-2 genomes downloaded from GenBank on 2022-01-17. Only sequences of at least 25 kbp were used for sampling.

RNA datasets

These are highly repetitive single gene RNA databases.

DatasetSizeNumber of
sequences
DateSource
SILVA 132 LSURef610 MB198,8432017-12-11
(12.6 MB)
source link
SILVA 132 SSURef Nr991.11 GB695,1712017-12-11
(41.7 MB)
source link
SILVA 132 SSURef3.28 GB2,090,6682017-12-11
(78.1 MB)
source link

Multiple Sequence Alignments

Aligned DNA sequences stored in FASTA format.

DatasetSizeNumber of
sequences
DateSource
UCSC hg38 7way knownCanonical-exonNuc340 MB1,470,1542014-06-06
(44.5 MB)
Alignments of 6 vertebrate genomes with human for CDS regions. Downloaded from UCSC, source (removed empty lines to normalize the FASTA format).
UCSC hg38 20way knownCanonical-exonNuc969 MB4,211,9402015-06-30
(75.9 MB)
Alignments of 19 mammalian genomes with human for CDS regions. Downloaded from UCSC, source (removed empty lines to normalize the FASTA format).

Protein databases

Protein data compresses more poorly than DNA/RNA (and sometimes even called incompressible). However they are also important datasets. Unfortunately most of sequence compressors don't support protein data.

DatasetSizeNumber of
sequences
DateSource
PDB67.6 MB109,9142019-04-09
(10.1 MB)
source link
Homo sapiens GRCh38 peptides all73.2 MB105,9612019-03-12
(8.47 MB)
source link
NCBI Virus RefSeq Protein122 MB373,3322020-05-10
(29.9 MB)
From NCBI Virus
UniProtKB Reviewed (Swiss-Prot)277 MB560,1182019-04-02
(57.4 MB)
From UniProt, release 2019_04, source link