An invertible transform for efficient string matching in labeled directed graphs

‍

Abstract:

The typical software pipeline for processing a human DNA sequencing sample begins by mapping reads to a previously assembled single linear reference---that is, a collection of long strings of nucleotides approximating one genome. This permits determining variation in the sample with respect to the reference, including the single nucleotide variants (SNVs), insertions, and deletions that contribute to phenotypes. However, when a read differs substantially from the reference due the presence of many SNVs, very long insertions or deletions, repeats, or structural variation such as inversions and translocations, it is often left unmapped, systematically biasing downstream analyses.

Pangenomics mitigates this issue by mapping reads to a representation of a diverse array of genomes rather than just one genome. This representation can take the form of an edge-labeled directed graph in which walks capture possible nucleotide sequences. A graph is potentially also useful for helping secure highly identifying variants in encoded genomes through obfuscation, where a variant is easily discerned only if a read in a new sample maps to it.

In this talk, I discuss a novel compressed index of an arbitrary edge-labeled directed graph for efficiently locating the terminal vertices of all walks matching a query sequence of labels. The index is a precise analog to the Burrows-Wheeler transform-based FM-index, which I also review, and which has for over a decade served as the workhorse for software mapping reads to single linear references. A component of the framework I propose is a random-like sequence that is the arbitrary-length analog to a de Bruijn sequence, which is of interest in combinatorics independently of pattern matching.

Bio:

Abhinav Nellore is an assistant professor specializing in computational biology between the Department of Biomedical Engineering and the Department of Surgery at Oregon Health & Science University. His group has worked on a variety of problems such as improving the accessibility and queryability of publicly available genomics data through uniform reanalysis, uncovering molecular biomarkers to aid in early cancer detection, understanding genetic correlates of susceptibility to and severity of viruses including SARS-CoV-2, and developing tools for aligning DNA sequencing reads to graph representations of genomes. He received a PhD in physics from Princeton University and a BS in physics from the University of Maryland, College Park. He further held positions as an operations research consultant, a mobile app developer, and a postdoctoral scholar in UCSF's Institute for Human Genetics and subsequently between Johns Hopkins University's Department of Computer Science and Department of Biostatistics.

Galois was pleased to host this tech talk for the public on June 21, 2023. See the video recording above.

An invertible transform for efficient string matching in labeled directed graphs

May 31, 2023

Abstract:

RELATED ARTICLES