9 October 2012 afternoon
Roger Barga, PhD
Microsoft Research (MSR)
A defining characteristic of modern life in enterprise, scientific and technical computing is the incredible proliferation of digital information. According to the Harvard Business Review, more data was generated last year than in all of previous human history. A recent article in the Economist estimates the amount of information created is growing annually at 60%. Companies such as Microsoft, Facebook and Google regularly run distributed applications on compute clusters to deliver insights and results from some of the largest data sets ever generated. There has been an explosion in available data parallel runtimes such as Hadoop! and Dryad, and in scalable data management systems designed to handle big data and support deep analytics. Research disciplines on the front lines of this data deluge are astronomy, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by leveraging techniques and best practices pioneered by industry and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.
This tutorial aims to clarify some of the critical concepts in the design space of big data and scalable data analytics and share lessons from industry that readily apply to science applications. We will provide a detailed tour of the primary technology and systems powering the analytic and scalable data management landscape in industry, identify appropriate systems for a specific set of application requirements in science and technical computing, and illustrate how they can be used to develop big data applications using practical, real-world examples such as weblog processing, distributed machine learning over large data sets, and sensor data processing.