The United States keeps a list of international travelers considered a grave threat against the health and security of the country. But this list isn’t about terrorism (at least not explicitly), it’s about microbiology. The pathogen priority list, kept by the National Institute of Allergy and Infectious Diseases, lists the viruses, bacteria and other microorganisms of greatest concern, in order to focus research and public health efforts on the most dangerous suspects. To aid this effort, Computation Institute scientists have worked for a decade on an open database of all available information about the bacteria-of-interest on the list, creating a massive online resource for the world’s most wanted bugs.
When the Pathosystems Resource Integration Center, aka PATRIC, launched in 2004, it presented a mere eight genomes in what a new paper for Nucleic Acids Research describes as “limited integration.” Today, four years after merging with the CI’s National Microbial Pathogen Data Resource, it contains over 10,000 annotated genomes for bacterial pathogens and their less harmful relatives, as well as “omics” data and a full computational toolbox for researchers to compare species and genes. Organized into a rich user interface (developed by a team at Virginia Tech) to help navigate through the deep pool of data, the resource supports efforts to develop new drugs, fight drug resistance and quickly respond to outbreaks of disease.
For a test run, put the disease of your choice into the search box. Search for “leprosy,” and you’ll find three genomes from bacteria associated with the disease, including one from a Mycobacterium leprae TN isolate passaged through an armadillo in Tamil Nadu, India. Click on any genome and gain access to a variety of data about the chosen genome, including the species’ place in the phylogenetic tree, a genome browser, information about protein pathways and protein-protein interactions, transcriptomics, and recently published literature about the bacteria.
It would be impossible to manually enter all of this data for each new genome, especially as the speed of sequencing leads to more and more published genomes. So behind the scenes, the PATRIC team erected an intricate, automated pipeline that pulls in genomic data as it is published from the GenBank and RefSeq repositories, then annotates each genome using RAST and SEED servers at Argonne National Laboratory. The process not only saves time, it ensures that all genomes in the PATRIC database are annotated via the same process, allowing for more valid comparisons. Genomes are also repeatedly re-annotated roughly every 2 months, to incorporate new roles discovered for previously unknown genes.
“One of the big advantages of this environment is that all genomes have annotations from a consistent controlled vocabulary,” said Rick Stevens, CI Senior Fellow, associate laboratory director for computing, environment and life sciences at Argonne National Laboratory, and co-principal investigator on PATRIC with Bruno Sobral of Virginia Bioinformatics Institute.
Currently, researchers can set up a private workspace on PATRIC to run their own analyses and record the results in a “digital laboratory notebook.” Soon, the team will allow users to B.Y.O.G. -- Bring Your Own Genome -- in order to use the PATRIC annotation and comparison tools on data that isn’t yet in the public system.
PATRIC researchers are also working to implement rapid response pipelines that will be useful during urgent cases of disease outbreaks. When a new bug appears, such as the 2011 E. coli outbreak that killed more than 50 people in Germany, researchers quickly sequence it and look for clues about its virulence. While PATRIC can already be used to compare the genome of the new pathogen to previously known samples in order to find important functional differences or weaknesses, the analysis is manual and slow. Stevens said that a new “Rapid Response” function will automate those comparisons whenever a new genome is uploaded to the resource, quickly providing those crucial reports to public health specialists.
[A map of the 2011 Escherichia coli O104H4 bacterial outbreak, from Wikimedia Commons]
"You get this automatic summary, almost like an automatic Nature paper, that says, new strain A, this is what makes it different,” Stevens said. “That's what a lot of the public health people want; something that can quickly sort out how bad is this thing we're dealing with.”
But the utility of PATRIC goes beyond fighting disease. Many of the bacteria included in the database are not dangerous to humans under normal conditions, but are useful comparisons to their pathogenic counterparts. Pulling data from these more docile species, some scientists use the database for research unrelated to infectious disease, Stevens said, drawing upon the data and computational tools to explore ocean life, biofuel production, and other areas of microbiological interest.
“Because the tool is very general, and we don’t place any restrictions on what genomes are used in any given analysis, it can be used to explore many scientific areas” Stevens said. “So that's kind of cool.”
While the resource’s expansion in its first decade from a handful of genomes to over 10,000 was dramatic, Stevens expects cheap sequencing to boost the rate even faster over the next few years. One project alone, looking at different strains of tuberculosis to study its genetic variation, is expected to publish 3,000 genomes in the next few months that will be sucked into the PATRIC database. As the numbers get higher and higher, and the Most Wanted Bacteria list gets longer and longer, new challenges in computation and user interface will appear.
“At some point with 100,000 genomes or a million genomes, we might end up being more selective, just because you can't navigate easily on a website with 10,000 or 100,000 items on a list,” Stevens said. “It might be possible to compute it, but it would be impossible to look at -- it's got more rows and columns than you have pixels on the screen. So we're going to need clever ways of choosing representatives or visualization or new ways to navigate through the data as it becomes larger. That's going to be really interesting.”