A vast amount of scientific knowledge is inaccessible to the scientific community due to the lack of computational resources or tools for small laboratories to share or analyze experimental results. With a new grant from the National Science Foundation, the Computation Institute will collaborate with leading institutions to look for ways that software can bring this data out of hiding, revealing untapped value in the "long tail" of scientific research.
The one-year, $500,000 planning grant enables investigators at the Computation Institute, University of California, Los Angeles, University of Arizona, University of Washington and University of Southern California to lay the groundwork for a proposed Institute for Empowering Long Tail Research as part of the NSF's Scientific Software Innovation Institutes program. Researchers will engage with scientists from fields such as biodiversity, economics and metagenomics to determine the optimal solutions for the increasingly challenging data and computational demands upon smaller laboratories.
"With limited resources and expertise, even simple data discovery, collection, analysis, management, and sharing tasks are difficult for small teams," said Ian Foster, director of the Computation Institute. "This project represents the first significant effort to understand and articulate these researchers' needs and translate them into a coherent roadmap for future research."
Mathematically, the long tail is the multitude of low-probability events distant from the mean in a statistical distribution. Businesses such as Amazon and Netflix have found value in the long tail by offering huge inventories of products that each may only be purchased by a few customers, but that create large profits in aggregate.
In science terms, the long tail is made up of the funded projects handled by individual laboratories or small groups of researchers, as opposed to large, expensive collaborations. An analysis of National Science Foundation funding in the year 2007 found that grants smaller than $1 million represent 80 percent of total dollars awarded and 98 percent of all awards.
But while these smaller projects make up the vast majority of American science, they are handicapped by a lack of resources dedicated to the storage, analysis and sharing of data. Large collaborations, such as the ENCODE project studying functional genomic elements, have dedicated funding, staff and technology to handle these cyberinfrastructure demands. But smaller budgets don't leave room or time to address these logistics.
"For these small teams, the growing importance of cyberinfrastructure and its applications in discovery and innovation is as much problem as opportunity," said Bill Howe, Affiliate Assistant Professor in the Department of Computer Science and Engineering at the University of Washington. "An unfortunate consequence is that in this 'long tail' of science, modern computational methods often are not exploited, much valuable data goes unshared, and too much time is consumed by routine tasks."
One example of untapped long tail potential is the phenomenon of "dark data," experimental results that exist only within the computers – or even the desk drawers – of individual researchers. Only a fraction of the data collected by a laboratory will eventually be published and accessible to the wider scientific community. But unpublished raw data, reflecting both positive and negative results, may be useful to other scientists studying similar topics to inform their own studies or prevent redundant experiments.
"There may only be a few scientists worldwide that would want to see a particular boutique data set, but there are many thousands of these data sets," said Bryan Heidorn, Director of the School of Information Resources and Library Science at the University of Arizona. "Access to these data sets can have a very substantial impact on science. In fact, it seems likely that transformative science is more likely to come from the tail than the head."
To find the best solution for researchers in the long tail, the collaboration will engage with researchers studying biodiversity, deep sea biospheres, economics, metagenomics and other subjects. Investigators will work with specific research groups in these fields, survey researchers in workshops at international conferences, and test out available software options in a select group of laboratories. These efforts will determine what the most pressing cyberinfrastructure needs are across scientific fields, as well as how to drive acceptance and sustained use of software proposed to resolve these issues.
Once the cross-cutting needs of researchers have been identified, the collaboration will propose a subsequent grant to establish an NSF-funded Institute for Empowering Long Tail Research that can refer scientists in various fields to the most effective and sustainable solutions. One promising approach is "software-as-a-service" (SaaS) tools, which outsource complex IT tasks to third-party providers. Existing SaaS tools designed for scientific use, such as Globus Online and SciFlex, may meet the demands of these researchers, or may be customized to suit a particular field's needs.
"We believe that a Scientific Software Innovation Institute focused on long tail research can have a transformative impact on US science," Foster said. "By accelerating discovery and innovation in those small laboratories where most research occurs, we can increase total research output, strengthen the powerhouses of US research, and motivate and prepare students to participate more effectively in research careers."
In addition to Foster, Howe and Heidorn, principal investigators include Christine Borgman of the University of California, Los Angeles and Carl Kesselman of the University of Southern California. The work is funded by the National Science Foundation, grant #1216872.