In the big data race, biology and medicine are usually thought to lag far behind physics and astronomy. Particle accelerators and telescopes can generate petabytes of data each year, but no equivalent large instruments exist yet for the biomedical sciences. But as CI Faculty and Senior Fellow Robert Grossman pointed out to introduce “How Big Data Supports Biomedical Discovery,” his session at the 2014 AAAS Annual Meeting, biology and medical researchers are quickly making up the data gap with a profusion of smaller instruments. The combined activity of gene sequencers, advanced imaging, electronic medical records, self-tracking devices and other technologies could soon produce a data stream even larger than those physicists and astronomers deal with today.
To store and analyze this flood of biomedical data will require new large-scale computational infrastructures that enable collaborative research while protecting patient privacy. At the session, Grossman and two other speakers presented overlapping projects that seek to harness these data sources into the discovery of new drugs and a more complete understanding of cancer and other diseases.
The tranSMART system, presented by Gilbert Omenn of the University of Michigan, originated in Europe, created by pharmaceutical researchers from Johnson & Johnson in 2007. The idea was to create a central repository for preclinical data from industry and academia, information that might have otherwise ended up “in a drawer,” and that could help save researchers from pursuing blind alleys that have already failed. Eventually, transSMART opened up its data, went open source, and spread to the United States, where organizations such as the Food & Drug Administration adopted the system.
“The goal here is to make a difference in discovery and drug development and in understanding and better patient care,” Omenn said. “It is a overall concept that matches NIH objectives: from big data to knowledge to action.”
An equally ambitious effort was presented by Lincoln Stein of the Ontario Institute for Cancer Research, who talked about creating a data coordinating center for the International Cancer Genome Consortium (ICGC). The ICGC seeks to sequence 25,000 cancer patients and their tumors, to look for mutations and other differences between their “normal” genome and the tumor. So far, the collaboration has collected samples from 10,000 donors, and when the collection is complete, Stein estimates there will be as much as 15 petabytes of data.
(Omenn, Stein, and Grossman take questions at the AAAS 2014 Meeting. Photo by Kevin Jiang.)
Storing and analyzing a dataset of that size is not currently plausible for any single computing center, so the ICGC hopes to use a distributed system, similar to the worldwide analysis network deployed by CERN’s Large Hadron Collider. In a pilot project, called the Whole Genome Pan-Cancer Analysis Project (PCAP), a smaller -- but still impressive -- dataset of half a petabyte will be split and simultaneously analyzed by six cloud computing centers, including the University of Chicago’s Bionimbus. Eventually, Stein hopes the data will reside permanently on a Cancer Genome Collaboratory, available for future research as analysis methods improve and knowledge about cancer evolves.
If 50,000 genomes sounds like a formidable task, Grossman argued that scientists need to think even bigger: the One Million Genome Challenge. To achieve the statistical power to unlock key discoveries about cancer and other diseases, such gaudy numbers may be necessary. But one million genomes equals one exabyte (1000 petabytes, or one billion gigabytes) of data, a scale that scientists don’t even know how to analyze, much less store and move. Yet it’s a worthwhile pursuit, Grossman said, in order to maximize the power of data-driven diagnosis and personalized treatments in medicine.
“If you have computing at this scale, you can ask questions that you might not ask otherwise,” Grossman said. “What if we can analyze the data each night, so when we wake up in the morning, we have the best available knowledge to make decisions that day?”
Grossman is helping build the infrastructure underlying this vision with the Bionimbus Protected Data Cloud, which last year became the first cloud-based system approved by the NIH for use of data from the Cancer Genome Atlas. Bionimbus allows approved researchers to quickly access and analyze data, circumventing the considerably bureaucracy and computational expense traditionally needed to work with these large datasets. It may also serve as an important step towards an eventual Biomedical Commons Cloud, a sort of data “village green” shared by researchers and organizations around the world to share data in a neutral way, Grossman said. Like physics and astronomy, the most important data-based discoveries will be made much faster if everyone -- enabled by computation -- works together, not separately.