Session V: A Zero Tolerance Policy for Failure Acceptance

Component counts and failure rates in extreme-scale systems have not reached the levels anticipated a decade ago. Complacency is setting in and interest and research funding for fault tolerance is waning.

Researchers are recording and counting failures, analyze their distribution, and create doomsday scenarios, but are not able to convince a general audience of the severity of the problem. One reason may be that, at current levels, these failures are not perceived to have much impact or cause much pain.

Automatic checkpoint and restart systems hide the problem. A user having to wait an extra day for a 10-day job to complete, is busy doing other work and does not notice the time lost. Yet, that is 10% more energy that was used, 10% more money spent on a larger machine than necessary, 10% of opportunity cost to run another job or bigger problem size in that time, a technician working overtime, more spare parts bought.

The cost of failures is not always easily quantified and is often diffused. Research into the true cost and overhead of failures now and in the future is needed to drive home the point that the time for action is now.

Download Presentation

Location: Grand Ballroom C Date: March 28, 2019 Time: 10:50 am - 11:10 am Rolf Riesen, Intel