This article originally appeared in the 2022 issue of Rice Engineering Magazine.
Imagine it’s a rainy Sunday afternoon and you are working on a 5,000-piece jigsaw puzzle. The average puzzle aficionado might be intimidated, but you’re a seasoned pro who breezes through 1,000-piece puzzles. You’re ready to go to work when you notice the cover of the puzzle box is blank. Then you see that many pieces are a single solid color, and some have frayed, broken edges. You say to yourself, “There’s no way I can put this together.” As you return the pieces to the box, you find a note on which is written: “Welcome to the impossible puzzle! You have three hours to finish it.”
COVID-19 started out as such a puzzle, which underscored the need for computational approaches and tools capable of detecting pathogens directly from environmental microbiomes — communities of bacteria and viruses found in soil, rivers and wastewater. A metagenome is simply the collection of genomes specific to a given ecosystem. Active monitoring for potential pandemic-causing pathogens directly from metagenomes is a key tool in any early warning system for future pandemics. Microbes have always been present among humans but remain a largely untapped predictor of future pandemics. Given its vast potential, wastewater monitoring is already being leveraged across the globe to track SARS-CoV-2 variants.
The puzzle analogy makes understandable many of the computational and public health challenges related to identifying SARS-CoV-2 in wastewater metagenomes. The vast number of puzzle pieces suggests the relatively short nucleotide stretches of DNA that sequencing machines can affordably and efficiently read from a given metagenome. The multiple combined puzzles represent the thousands of distinct bacteria and viruses which are often closely related. The blank box cover represents the many unknown and previously unseen pathogens in environmental microbiomes. The monochromatic pieces are the closely related species and strains inhabiting a community. The frayed edges represent errors introduced by DNA sequencing machines, making it difficult to find overlapping edges. The time limit for assembling the puzzle represents the urgency of finding results.
Computer science plays a critical role in addressing such complications. Advances in computational methods for genome assembly resulted in completion of the human genome in 2022, 20 years after the first assembly of the genome was announced. Advances in DNA sequencing coupled with improved computational approaches for piecing together the puzzle yielded the first truly completed version. However, limitations in reading long stretches of individual genomes within complex microbial communities leaves the task of assembling individual SARS-CoV-2 variants from wastewater computationally intractable. Computer scientists have often employed data-driven probabilistic approaches and heuristics to puzzle together individual variants, or delineate conditions that help make the impossible possible. One such example is using the active prevalence of specific SARS-CoV-2 variants of concern to narrow the search space. Another is identifying signature mutations of a specific lineage, or including sequence data to reduce the number of puzzle pieces.
In collaboration with Rice environmental engineer Lauren Stadler, statistics professor Kathy Ensor and Houston Health Department scientist Loren Hopkins, my group has created tools capable of completing SARS-CoV-2 VoC puzzles efficiently and accurately. The overarching goal of my group is to build reliable computational methods and pipelines to help prevent the next pandemic. Remaining challenges include detection of yet-unseen pathogens, reducing the time for detection of known pathogens and pursuing computational approaches that can assemble individual puzzles representing pathogens of interest within multitudes of environmental samples. This may seem to resemble that impossible puzzle, but in the words of Nelson Mandela, “It always seems impossible until it’s done.”
Engineers in their own words: Rice computer scientist Todd Treangen