March 27, 2024

DOI

Data-structures for sets of sequences: a third update

This post starts with the exploration initiated in an earlier paper, and blog post, and today I’ll present the fresh contributions of 2023-2024. Inspired by a moment of reflection shared on by Lior Pachter about his blog, I’ve decided to incorporate DOIs in my posts so they can be cited.

For those diving into this post, a prerequisite is the understanding of sequence bioinformatics concepts, such as the BWT, minimizers, de Bruijn graphs (you can start with the previous citations).

Today’s discussion revolves around the following figure. I adopted a solution that lays out data structures on twin discs, because honeslty, a standard tree or a table takes just too much space to be correctly organised. The left disc shows structures enabling k-mer set representation. Representation is a vague word, I make an attempt at categorizing what the structures actually do using colors. I utilize several disc radius to illustrate how fundamental concepts such as hashing or bit vectors are derived and associated with useful tools (eg minimizers) to obtain the final structures, that appear on the outermost radius.

drawing

In the right disc, I show colored k-mer set collections. “Colored” in this context means each k-mer set bears metadata, typically denoting its dataset of origin (it is totally different from the colors I use in the legend). It quite natural that colored k-mer set collections integrate structures from the left disc, hence some of the methods in the inner circles of the right disc already appear in the left one.

Each disc comes with a color legend, to detail features:

Insights

There are, in fact, many ways to categorize the structures, this is one attempt. The disc organization revolves around filiation between data structures, while the colors reflect more end user experience on features. But features are numerous. For instance, the capacity to extract other superstrings of k-mer of interest, such as simplitigs or eulertigs, is not represented, but for instance possible in GGCAT.

Similarly, the filiation can be discussed. Many contributions are a patchwork of ideas. Some are hybrids (eg PAC or BQF). Navigating the BWT landscape is difficult (at least for me), and I hope it will be the next topic of a future post. For instance here, I omit relatedness to Wheeler graphs. It is noteworthy that some of the presented indexes based on the BWT can represent sequences longer than the k-mers, that is the case for MOVI and SPUMONI 1 and 2 (other structures are specialized towards k-mers).

Finally, note that some concepts are disseminated in so many works that I could not totally factorize them, this is especially the case for minimizers.

Detailing the left disc:

On the right disc:

Global notes:

References of newest papers:

Structures that can handle k-mer sets: 

Structures for colored sets: 

Edit 03/28: Simon Puglisi reminded me to add Bahar Alipanahi to the list of contributing women.