November 2, 2022

Data-structures for sets of k-mer sets: what’s new since 2020

Our paper Data structures based on k-mers for querying large collections of sequencing data sets is two years old, and although the core concepts we described are still relevant, new ideas and methods have emerged or strengthened since. Thus I wanted to keep you up with the literature in a short post.

These structures store collections of datasets (often samples, can also be sets of contigs or genomes) using their k-mers. They are used to perform k-mer queries in all samples at once, thus provide an interesting pre-filter for large collections before more costly methods such as mapping are applied. The presented methods are designed for short read samples or relatively small/medium k-mer sizes.

The landscape of sets of k-mer sets

In the first figure of the survey, we had chosen to present the different implemented methods in a timeline. I’ve been trying to keep it updated. At the time, we narrowed the paper to a defined set of tools, although it is clear that frontiers are blurred between some data-structures. Here I try to present a larger view. Despite trying my best, these things are never exhaustive, it is possible that I’ve forgot to mention tools.

Please note that tools for pangenomics graphs such as VG, minigraph and others do not appear here. They differ mainly because here the expected input can be unordered, while pangenomic representations include coordinates.

drawing

The paper was also organized around a dichotomy between two types of structures. First, color-aggregative methods, that index the union set of all distinct k-mers of the collection, before associating colors with these k-mers (colored de Bruijn graphs usually fall in this category). Second, k-mer aggregative methods index each data-set separately, then build another data-structure on top to distribute queries.

Today, I would rather present the methods in a 2 dimensional fashion:

drawing

There are more exact methods than inexact ones. SeqOthello remains an alien (the single red point in its category).

Main novelties

Beyond presence/absence

More to the inexact structures landscape

Improvements in inner components