Genomics Institute tool becomes primary method to identify lineages of COVID-19 worldwide

Date
ecerf@ucsc.edu (Emily Cerf)

As COVID-19 continues to mutate, software developed and maintained at the University of California, Santa Cruz’s Genomics Institute will now be at the core of the primary tool used by health officials worldwide to track the spread of variants in their community. It is now the default software behind the ubiquitously used tool Pangolin, replacing previous software to more accurately assign genomic samples of COVID-19 to a known branch on the virus’s family tree.  

The software tool, called Ultrafast Sample Placement on Existing tRees (UShER), is used to maintain and update a phylogenetic tree – a diagram of the virus’s evolution – of the more than 1,500 lineages that have been identified as mutated from an early genomic sequence from Wuhan, China in late 2019. The majority of all public health organizations such as the Centers for Disease Control and the California Department of Public Health will now use this tool on a daily basis. It is a key element in tracking Omicron’s BA.2 sublineage, which is now the U.S.’s dominant strain.

“This lineage system is the primary way we track the virus right now, it’s the number one thing we’re looking at to see where it’s spreading, where it’s coming from,” said Russ Corbett-Detig, an assistant professor of biomolecular engineering at the Baskin School of Engineering and co-creator of the tool. “When new important variants show up, such as Omicron, this Pango-lineage assignment is the way you [track spread] – so that's the central component of this process.”

Using UShER

When a public health official runs a sample of COVID-19 genomic data through UShER, the tool tells them on which branch of the broader phylogenetic tree their samples fit, therefore telling them which lineages of the virus are present in their community. Subtrees are then generated to help officials understand the most closely related genomic samples, which is useful for understanding the virus in a specific population. 

The tree currently represents about 8.8 million genomic samples of COVID-19, a massive scale never seen before in phylogenetics. UCSC researchers worked to enable UShER to computationally support this unprecedented amount of data, and in real time. 

“All of our work is built on UShER’s ability to rapidly add a new genome sequence to a very large tree of genome sequences,” said UCSC Bioinformatics Programmer Angie Hinrichs. “So at UCSC, we've been maintaining a gigantic tree of all of the SARS-COV-2 genomes, and it's up to 8.8 million genomes and growing – it's just unprecedented.”

Hinrichs updates the tree daily to include all new genomes shared by researchers worldwide from data repositories such as GISAID and the International Nucleotide Sequence Database Collaboration (INSDC), adding new branches for previously unseen mutations of the virus along with many sequences identical to existing branches. 

Some branches show both new mutations and epidemiological significance, such as appearing in a new geographic location and/or growing faster than other branches. In these instances, the Pango lineage curation team may give the branch a name (such as B.1.1.7, which was later designated as the Alpha variant by the World Health Organization) and it can later potentially be identified as a variant of concern by public health agencies like the CDC or the WHO. 

UShER was crucially used in this way to help identify the contagious Omicron variant responsible for the recent global surge of cases, when a researcher uploaded Omicron sequences to the UShER tree and noticed that they formed a radically diverged new branch. 

Public health officials can use this information from UShER to trace chains of transmission in superspreader events by comparing the genomic information of samples: if the sequences are similar one person is likely spreading the virus to many and if the sequences are very different, there are probably multiple unrelated infections.

“Health officials, and by effect, the public at large, will greatly benefit from this revolutionary method of tracing COVID-19’s mutations,” said Alexander Wolf, dean of the Baskin School of Engineering. “The Genomics Institute continues to be at the forefront of building software essential for the understanding of disease.”

Sizing up the software

UShER was developed in the early days of the pandemic when Corbett-Detig and UC San Diego assistant professor and former Genomics Institute postdoctoral scholar Yatish Turakhia realized the world would need an efficient, scalable way to track COVID-19’s family history. Along with their collaborators, they created the centralized resource so that anyone around the world can upload data and use the tool for free. 

“This only works because it is a massively distributed effort where lots and lots of people are sharing genome sequences and contributing to the development of bioinformatics tools to analyze them,” Corbett-Detig said. “It's an absolutely amazing thing and I think it’s exciting to contribute.” 

Now, UShER is the default software behind the ubiquitously used tool called Pangolin, which was developed at the University of Edinburgh to determine which lineage of the virus a given genome sequence fits into. Previous versions of Pangolin defaulted to a different software called PangoLEARN, which used machine learning, a type of artificial intelligence algorithm, to do this same work. 

A preprint comparing UShER and PangoLEARN posted on Virological, a discussion forum for the analysis of virus genomes, showed that the UCSC software was much more consistent and accurate at placing genomic samples on the COVID-19 family tree. The results of this comparison prompted the Pangolin team to offer to make UShER the default software in the next major update to the tool. 

“That’s been my dream for over a year,” Hinrichs said. “I think phylogenetic trees are the way this should be done, and I’m really grateful that the Pangolin team has been willing to work with us on that.”

The UShER phylogenetic tree is more visually and conceptually interpretable than PangoLEARN’s machine learning black box, where the algorithm is extremely complicated. 

Crucially, UShER is more stable in its lineage assignment in that it is less likely to retroactively switch the lineage assignment of samples when updated, something that frequently happened with PangoLEARN. 

“For [public health officials] obviously it's a big pain point, because if the lineage designation of their samples keep switching then they just don't know how to interpret it,” Turakhia said. “That could end up wasting a lot of their valuable hours. So they need to be able to trust the tool, and I definitely feel that [UShER] having stability and being fairly accurate, that should make them more comfortable.”

While the PangoLEARN machine-learning model assigns samples to lineages relatively faster, it is more likely to make mistakes in assigning lineages. So, it is well worth the trade-off in time for the much more precise lineage assignment that UShER offers, said Corbett-Detig, given that this time difference is negligible in the overall process of gathering COVID-19 genomic samples.

UShER’s free tree is also used to power other COVID-19 bioinformatics tools like Taxonium and covSPECTRUM that allow researchers and public health officials to visualize and understand the virus’s family tree. 

In the future, researchers hope to continue to optimize the software to increase the accuracy of lineage assignment, make the software faster, and Turakhia also hopes to see this model be made generalizable to other infectious diseases.