In March 2020, when the WHO declared a pandemic, the public sequence database GISAID held 524 covid sequences. Over the next month scientists uploaded 6,000 more. By the end of May, the total was over 35,000. (In contrast, global scientists added 40,000 flu sequences to GISAID in all of 2019.)
“Without a name, forget about it—we cannot understand what other people are saying,” says Anderson Brito, a postdoc in genomic epidemiology at the Yale School of Public Health, who contributes to the Pango effort.
As the number of covid sequences spiraled, researchers trying to study them were forced to create entirely new infrastructure and standards on the fly. A universal naming system has been one of the most important elements of this effort: without it, scientists would struggle to talk to each other about how the virus’s descendants are traveling and changing—either to flag up a question or, even more critically, to sound the alarm.
Where Pango came from
In April 2020, a handful of prominent virologists in the UK and Australia proposed a system of letters and numbers for naming lineages, or new branches, of the covid family. It had a logic, and a hierarchy, even though the names it generated—like B.1.1.7—were a bit of a mouthful.
One of the authors on the paper was Áine O’Toole, a PhD candidate at the University of Edinburgh. Soon she’d become the primary person actually doing that sorting and classifying, eventually combing through hundreds of thousands of sequences by hand.
She says: “Very early on, it was just who was available to curate the sequences. That ended up being my job for a good bit. I guess I never understood quite the scale we were going to get to.”
She quickly set about building software to assign new genomes to the right lineages. Not long after that, another researcher, postdoc Emily Scher, built a machine-learning algorithm to speed things up even more.
They named the software Pangolin, a tongue-in-cheek reference to a debate about the animal origin of covid. (The whole system is now simply known as Pango.)
The naming system, along with the software to implement it, quickly became a global essential. Although the WHO has recently started using Greek letters for variants that seem especially concerning, like delta, those nicknames are for the public and the media. Delta actually refers to a growing family of variants, which scientists call by their more precise Pango names: B.1.617.2, AY.1, AY.2, and AY.3.
“When alpha emerged in the UK, Pango made it very easy for us to look for those mutations in our genomes to see if we had that lineage in our country too,” says Jolly. “Ever since then, Pango has been used as the baseline for reporting and surveillance of variants in India.”
Because Pango offers a rational, orderly approach to what would otherwise be chaos, it may forever change the way scientists name viral strains—allowing experts from all over the world to work together with a shared vocabulary. Brito says: “Most likely, this will be a format we’ll use for tracking any other new virus.”
Many of the foundational tools for tracking covid genomes have been developed and maintained by early-career scientists like O’Toole and Scher over the last year and a half. As the need for worldwide covid collaboration exploded, scientists rushed to support it with ad hoc infrastructure like Pango. Much of that work fell to tech-savvy young researchers in their 20s and 30s. They used informal networks and tools that were open source—meaning they were free to use, and anyone could volunteer to add tweaks and improvements.
“The people on the cutting edge of new technologies tend to be grad students and postdocs,” says Angie Hinrichs, a bioinformatician at UC Santa Cruz who joined the Pangolin project earlier this year. For example, O’Toole and Scher work in the lab of Andrew Rambaut, a genomic epidemiologist who posted the first public covid sequences online after receiving them from Chinese scientists. “They just happened to be perfectly placed to provide these tools that became absolutely critical,” Hinrichs says.
It hasn’t been easy. For most of 2020, O’Toole took on the bulk of the responsibility for identifying and naming new lineages by herself. The university was shuttered, but she and another of Rambaut’s PhD students, Verity Hill, got permission to come into the office. Her commute, walking 40 minutes to school from the apartment where she lived alone, gave her some sense of normalcy.
Every few weeks, O’Toole would download the entire covid repository from the GISAID database, which had grown exponentially each time. Then she would hunt around for groups of genomes with mutations that looked similar, or things that looked odd and might have been mislabeled.
When she got particularly stuck, Hill, Rambaut, and other members of the lab would pitch in to discuss the designations. But the grunt work fell on her.
Deciding when descendants of the virus deserve a new family name can be as much art as science. It was a painstaking process, sifting through an unheard-of number of genomes and asking time and again: Is this a new variant of covid or not?
“It was pretty tedious,” she says. “But it was always really humbling. Imagine going through 20,000 sequences from 100 different places in the world. I saw sequences from places I’d never even heard of.”
As time went on, O’Toole struggled to keep up with the volume of new genomes to sort and name.
In June 2020, there were over 57,000 sequences stored in the GISAID database, and O’Toole had sorted them into 39 variants. By November 2020, a month after she was supposed to turn in her thesis, O’Toole took her last solo run through the data. It took her 10 days to go through all the sequences, which by then numbered 200,000. (Although covid has overshadowed her research on other viruses, she’s putting a chapter on Pango in her thesis.)
Fortunately, the Pango software is built to be collaborative, and others have stepped up. An online community—the one that Jolly turned to when she noticed the variant sweeping across India—sprouted and grew. This year, O’Toole’s work has been much more hands-off. New lineages are now designated mostly when epidemiologists around the world contact O’Toole and the rest of the team through Twitter, email, or GitHub— her preferred method.
“Now it’s more reactionary,” says O’Toole. “If a group of researchers somewhere in the world is working on some data and they believe they’ve identified a new lineage, they can put in a request.”
The deluge of data has continued. This past spring, the team held a “pangothon,” a sort of hackathon in which they sorted 800,000 sequences into around 1,200 lineages.
“We gave ourselves three solid days,” says O’Toole. “It took two weeks.”
Since then, the Pango team has recruited a few more volunteers, like UCSC researcher Hindriks and Yale researcher Brito, who both got involved initially by adding their two cents on Twitter and the GitHub page. A postdoc at the University of Cambridge, Chris Ruis, has turned his attention to helping O’Toole clear out the backlog of GitHub requests.
O’Toole recently asked them to formally join the organization as part of the newly created Pango Network Lineage Designation Committee, which discusses and makes decisions about variant names. Another committee, which includes lab leader Rambaut, makes higher-level decisions.
“We’ve got a website, and an email that’s not just my email,” O’Toole says. “It’s become a lot more formalized, and I think that will really help it scale.”
A few cracks around the edges have started to show as the data has grown. As of today, there are nearly 2.5 million covid sequences in GISAID, which the Pango team has split into 1,300 branches. Each branch corresponds to a variant. Of those, eight are ones to watch, according to the WHO.
With so much to process, the software is starting to buckle. Things are getting mislabeled. Many strains look similar, because the virus evolves the most advantageous mutations over and over again.
As a stopgap measure, the team has built new software that uses a different sorting method and can catch things that Pango may miss.