9 October 2012 afternoon
Roger Barga, PhD
Microsoft Research (MSR)
A defining characteristic of modern life in enterprise, scientific and technical computing is the incredible proliferation of digital information. According to the Harvard Business Review, more data was generated last year than in all of previous human history. A recent article in the Economist estimates the amount of information created is growing annually at 60%. Companies such as Microsoft, Facebook and Google regularly run distributed applications on compute clusters to deliver insights and results from some of the largest data sets ever generated. There has been an explosion in available data parallel runtimes such as Hadoop! and Dryad, and in scalable data management systems designed to handle big data and support deep analytics. Research disciplines on the front lines of this data deluge are astronomy, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by leveraging techniques and best practices pioneered by industry and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.
This tutorial aims to clarify some of the critical concepts in the design space of big data and scalable data analytics and share lessons from industry that readily apply to science applications. We will provide a detailed tour of the primary technology and systems powering the analytic and scalable data management landscape in industry, identify appropriate systems for a specific set of application requirements in science and technical computing, and illustrate how they can be used to develop big data applications using practical, real-world examples such as weblog processing, distributed machine learning over large data sets, and sensor data processing.
The intended audience of the tutorial includes project leads, researchers, application developers, and practitioners with a keen interest on system support for large-scale, data-intensive applications. The expected main learning outcomes are: (a) increased understanding of available data parallel runtimes such as Hadoop! and in what situations and domains they are suitable for large-scale data analysis, (b) in-depth coverage of advanced scalable storage and processing techniques (several of which have not yet been transferred into commercial systems and are available as open source) and of promising avenues for academic and industrial research, and (c) surveying the state of-the-art in industry for performing data analysis over extreme scale data sets, along with promising science applications enabled by those systems. We organize the tutorial into three main technical parts (in bold) as outlined below.