8th IEEE International Conference on eScience 2012

Chicago, USA, 8-12 October 2012

Tutorial

Big Data Processing: Lessons from Industry and Applications in Science

9 October 2012 afternoon
Roger Barga, PhD
Microsoft Research (MSR)

A defining characteristic of modern life in enterprise, scientific and technical computing is the incredible proliferation of digital information. According to the Harvard Business Review, more data was generated last year than in all of previous human history. A recent article in the Economist estimates the amount of information created is growing annually at 60%. Companies such as Microsoft, Facebook and Google regularly run distributed applications on compute clusters to deliver insights and results from some of the largest data sets ever generated. There has been an explosion in available data parallel runtimes such as Hadoop! and Dryad, and in scalable data management systems designed to handle big data and support deep analytics. Research disciplines on the front lines of this data deluge are astronomy, bioscience, geoscience, and the social sciences. Science communities will learn how to manage this morass of data by leveraging techniques and best practices pioneered by industry and, more importantly, by inventing new techniques that meet the specific demands of their scientific disciplines.

This tutorial aims to clarify some of the critical concepts in the design space of big data and scalable data analytics and share lessons from industry that readily apply to science applications. We will provide a detailed tour of the primary technology and systems powering the analytic and scalable data management landscape in industry, identify appropriate systems for a specific set of application requirements in science and technical computing, and illustrate how they can be used to develop big data applications using practical, real-world examples such as weblog processing, distributed machine learning over large data sets, and sensor data processing.

The intended audience of the tutorial includes project leads, researchers, application developers, and practitioners with a keen interest on system support for large-scale, data-intensive applications. The expected main learning outcomes are: (a) increased understanding of available data parallel runtimes such as Hadoop! and in what situations and domains they are suitable for large-scale data analysis, (b) in-depth coverage of advanced scalable storage and processing techniques (several of which have not yet been transferred into commercial systems and are available as open source) and of promising avenues for academic and industrial research, and (c) surveying the state of-the-art in industry for performing data analysis over extreme scale data sets, along with promising science applications enabled by those systems. We organize the tutorial into three main technical parts (in bold) as outlined below.

Tutorial Outline
  • Opening
    • Rise of Big Data and Scalable Analytics
    • Distributed Data-Parallel Computing
    • Examples of Data Intensive Computing in Industry
    • Common Technologies, Tools and Practices
  • Data Parallel Runtimes
    • MapReduce for Big Data
    • A closer look at Hadoop!
    • Hybrid Approaches (e.g., Twister, Sector/Sphere)
    • MapReduce vs. SQL
    • Performance Considerations
    • Large-Scale Data Analytics and Machine Learning in MapReduce
  • Data Platforms for Large Scale Applications
    • Key value Stores
    • Column Stores
      • Other NO SQL Stores
    • MapReduce in SQL (HadoopDB)
    • Advanced Solutions in Industry
  • Examples of Data Intensive Computing in Science
    • Astronomy
    • Bioinformatics
    • Earth Science
    • Best Practices and Lessons Learned
  • Closing Remarks and Q&A

Important Dates

  • Workshop proposal deadline: 23 January 2012
  • Workshop notification: 6 February 2012
  • Abstract submissions: 4 July 2012 (no longer required)
  • Paper submissions: 11 18 July 2012 (firm)
  • Paper notification: 22 August 2012
  • Early Results and Works-in-Progress Poster submission: 24 August 2012
  • Early Results and Works-in-Progress Poster author notification: 4 September 2012
  • Camera ready papers due: 10 September 2012
  • Hotel room block registration deadline: 16 19 September 2012
  • Advance registration deadline: 28 September 2012
  • Microsoft eScience Workshop: 8-9 October 2012
  • Workshops: 8-9 October 2012
  • Conference Sessions: 10-12 October 2012

 

Support provided by:
GOLD LEVEL
Microsoft Research
SILVER LEVEL
CSIRO Australia
BRONZE LEVEL
Argonne National Labs Indiana University EMC2
ALSO
University of Chicago Cray Computation Institute
MEDIA
HPCwire iSGTW - International Science Grid This Week
SPONSORS
IEEE IEEE Computer Society