When Charles Darwin took his historic voyage aboard the HMS Beagle from 1831 to 1836, "big data" was measured in pages. On his travels, the young naturalist produced at least 20 field notebooks, zoological and geological diaries, a catalogue of the thousands of specimens he brought back and a personal journal that would later be turned into The Voyage of the Beagle. But it took more than two decades for Darwin to process all of that information and into his theory of natural selection and the publication of On the Origin of Species.
While biological data may have since transitioned from analog pages to digital bits, extracting knowledge from data has only become more difficult as datasets have grown larger and larger. To wedge open this bottleneck, the University of Chicago Biological Sciences Division and the Computation Institute launched their very own Beagle -- a 150-teraflop Cray XE6 supercomputer that ranks among the most powerful machines dedicated to biomedical research. Since the Beagle's debut in 2010, over 300 researchers from across the University have run more than 80 projects on the system, yielding over 30 publications.
"We haven't had to beat the bushes for users; we went up to 100 percent usage on day one, and have held pretty steady since that time," said CI director Ian Foster in his opening remarks. "Supercomputers have a reputation as being hard to use, but because of the Beagle team's efforts, because the machine is well engineered, and because the community was ready for it, we've really seen rapid uptake of the computer."
A sampler of those projects was on display last week as part of the first Day of the Beagle symposium, an exploration of scientific discovery on the supercomputer. The projects on display covered the very big -- networks of genes, regulators and diseases built by UIC's Yves Lussier -- to the very small -- atomic models of molecular motion in immunological factors, cell structures and cancer drugs. Beagle's flexibility in handling projects from across the landscape of biology and medicine ably demonstrated how computation has solidified into a key branch of research in these disciplines alongside traditional theory and experimentation.
In the day's first research talk, Kazutaka Takahashi of the Department of Organismal Biology showed how science can move fluidly between these realms. Takahashi studies how the neurons of the brain's motor cortex behave during eating, using a very tiny electrode array that can record dozens of neurons simultaneously. But recording is just the beginning -- the analysis required to untangle how these neurons connect and influence each other during a meal is far more than your everyday computer can handle. Takahashi said the software was originally set up to analyze no more than five neurons at a time, and trying to do 70 at once would take months on a desktop PC. Moving their analysis to Beagle freed up the researchers to more rapidly tease out the neural network and relate it to different stages of chewing and swallowing, experiments that may someday help stroke sufferers regain normal eating ability.
Other Beagle projects are dedicated to finding new medical treatments in data that's already been collected. Lussier, a former CI fellow, is constructing enormous networks of disease using data from the Human Genome Project, the ENCODE study of gene regulatory elements and clinical research to find new genetic and pathway targets in different types of cancer. Only a supercomputer can sort through the thousands of different possible combinations and hypotheses -- one recent analysis required around 2 million core hours to complete, a task that took Beagle only 20 days as compared to approximately 14 years on a desktop. Lussier hopes to use similar approaches on complex diseases such as diabetes, funneling the rising tide of genomic data into "the medicine of tomorrow."
On the other extreme, CI Senior Fellow Benoit Roux is running molecular dynamics simulations on Beagle to study the activity of an already-proven treatment: Gleevec. One of the first successful targeted cancer therapies, Gleevec was developed in the 1990's to treat certain types of leukemia by switching off an overactive protein. Roux's models simulate the motion of individual atoms to examine exactly how the drug binds its target and investigate Gleevec's chemical relatives to see why they differ in their effectiveness. The investigations could lead to the design of better drugs, or help physicians circumvent the drug resistance that develops after prolonged use.
Other researchers at Day of the Beagle described the molecular worlds that they are simulating on the supercomputer. Edwin Munro's research on the elaborate choreography of the cytoskeletal elements actin and myosin revealed the strategies used by cells to re-shape, squeeze and split themselves. Greg Tietjen used computer models and x-ray scattering experiments to study how the Tim protein family recognizes different combinations of lipid membrane proteins and makes decisions as part of the immune response. Molecular dynamics models can even be a teaching tool, as Esmael Jafari Haddadian explained in his talk on using Beagle in an undergraduate course on quantitative biology. For their final project, students picked a protein of interest -- such as yellow mealworm beetle anti-freeze protein -- and modeled its behavior in a simulated solution.
But perhaps the biggest computational challenge facing modern biologists is how to manage and make sense of the flow of genomic data as genetic sequencing becomes cheaper and more routine. Jason Pitt, a student in the laboratory of CI senior fellow Kevin White, described the SwiftSeq workflow for the parallel processing of terabytes of raw data from the Cancer Genome Atlas on Beagle, which reduced compute time by as much as 75 percent. Another data pipeline, MegaSeq, was the subject of Megan Puckelwartz's talk, which focused on whole genomic sequencing for studying rare genetic variants associated with the heart disease cardiomyopathy.
"Beagle is ideal for whole genome analysis," Puckelwartz said. "Other people are done after their first run, but Beagle allows us to extract the most data and continuously mine that data as new methods of analysis come available."
For more information on Beagle and the science it enables, visit the Beagle website.