November 2, 2022

Data-structures for sets of k-mer sets: what’s new since 2020

Our paper Data structures based on k-mers for querying large collections of sequencing data sets is two years old, and although the core concepts we described are still relevant, new ideas and methods have emerged or strengthened since. Thus I wanted to keep you up with the literature in a short post.

These structures store collections of datasets (often samples, can also be sets of contigs or genomes) using their k-mers. They are used to perform k-mer queries in all samples at once, thus provide an interesting pre-filter for large collections before more costly methods such as mapping are applied. The presented methods are designed for short read samples or relatively small/medium k-mer sizes.

The landscape of sets of k-mer sets

In the first figure of the survey, we had chosen to present the different implemented methods in a timeline. I’ve been trying to keep it updated. At the time, we narrowed the paper to a defined set of tools, although it is clear that frontiers are blurred between some data-structures. Here I try to present a larger view. Despite trying my best, these things are never exhaustive, it is possible that I’ve forgot to mention tools.

Please note that tools for pangenomics graphs such as VG, minigraph and others do not appear here. They differ mainly because here the expected input can be unordered, while pangenomic representations include coordinates.

drawing

The paper was also organized around a dichotomy between two types of structures. First, color-aggregative methods, that index the union set of all distinct k-mers of the collection, before associating colors with these k-mers (colored de Bruijn graphs usually fall in this category). Second, k-mer aggregative methods index each data-set separately, then build another data-structure on top to distribute queries.

Today, I would rather present the methods in a 2 dimensional fashion:

drawing

There are more exact methods than inexact ones. SeqOthello remains an alien (the single red point in its category).

Main novelties

Beyond presence/absence

Following Reindeer, other structures started to allow the recording of k-mer features that are different from presence/absence of a dataset, e.g., abundance, contig id… Needle is the first inexact data-structure with this goal, and was released in 2022.
Metagraph appeared shortly after the survey, followed by an additional paper for recording other k-mer features in the data-structure. It was a breakthrough because of the amount of data that they could index. I would say the approach is a bit different from e.g. BiFrost or Reindeer: Metagraph’s team have computed numerous indexes that are available for the public on their website, while other tools are maybe more meant to be installed and run locally.

More to the inexact structures landscape

We’ve briefly mentioned Dream-Yara’s interleaved Bloom filter (IBF) technique back in 2020. To me, they are very close to what COBS proposed with the inverted matrix of Bloom filters, the difference being the organization of bit vectors and some optimization choices. The team has developped this technique and several papers came out: Raptor, Needle and a preprint this year.
With PAC, we proposed an attempt to conciliate the interesting query complexity properties of the SBTs and the efficient inverted index inspired by document retrieval, that COBS introduced.
The inverted index has started to hybridize with other techniques, such as sketching in this work.
Satellite algorithms and data-structures for inexact structures emerged, such as quick Bloom filter construction and filtering or false positive reductions in Bloom filters.

Improvements in inner components

We could say that the spectrum preserving string set (SPSS) concept matured, with many novel contributions, some of which help with storing efficiently the k-mer sets in our data-structures.
SPSS also inspire new developments, such as reference sequence representation in Pufferfish2
Some of the exact methods/colored de Bruijn graphs relied on efficient hashing and notably minimal perfect hashing, which has seen many improvements in two years (see here for the latest). Along with these work, the question of a better compression of the features that are associated with k-mers starts to emerge (see here). Compression is certainly one of the next big challenge for these structures.
BiFrost has been one of the most successful datastructures for buildling compressed colored de Bruijn graphs in the last years, with notably an innovative unitig construction step. GGCAT appears a nice improvement. Bentley-Saxe Mantis brought more scalability and dynamicity.
The field is composed of many combinations of fast evolving data-structures and ideas. A great idea is to enable mixing and matching different modules to see which combination works well. Well good news is, this is a WIP. You can learn more on the Piscem initiative in this thread