When something goes wrong while you're running a program on your personal computer, the worst outcome is typically a reboot and the loss of any unsaved work. But when an application crashes on a supercomputer, the consequences can be much more dramatic. The failure of just one core CPU amongst hundreds of thousands can interrupt a calculation that takes weeks to complete, wasting valuable time and resources, or corrupting the data output of a simulation, sometimes in hard-to-detect fashion. As these machines have grown larger and more powerful, computer scientists have developed "fault tolerance" strategies to avoid these scenarios, but as the high-performance computing field approaches its next era, the exascale, new methods for resilience will be necessary.
That preemptive troubleshooting was the focus of Argonne computer scientist Franck Cappello's talk at the Computation Institute on February 12th. Even though the first exascale computers -- capable of a quintillion, or a billion billion, calculations per second -- are not expected to come online until at least 2020, Cappello's group at Argonne National Laboratory, the University of Illinois at Urbana-Champaign, and Inria has been working on new resilience approaches since 2009, anticipating and solving problems that the next generation of high-performance computers will present.
Surprisingly, many of the fault tolerance strategies used by supercomputers -- such as rollback recovery, which reverts all processes back to the last saved state -- are the same as those used in common smartphones and servers. But with exascale machines expected to contain as many as 10 million cores, these methods will no longer be practical -- imagine having to rolll back 9,999,999 healthy cores because of a glitch in one.
In order to develop new methods that efficiently use computing resources, memory, and energy, unique solutions for HPC applications are needed, Cappello said.
"If we want to address fault tolerance and resilience at the exascale, we need to be more specific," he said. "We need to look at the applications, we need to look at the system, extract some properties, and then define new mechanisms."
To learn more about the strategies developed by Cappello and his collaborators, such as multi-level checkpointing and detection of "silent" data corruption through prediction, watch the full talk.