Session V: The Role of Network Level Fault Tolerance

The network, being the backbone of parallel systems, is required to handle errors as part of the regular course of operation.  Such errors may come from the physical layer, or in the case of Smart Networks, algorithmic errors, where shutting down the network in response to such failures would make the system unusable.  This presentation will give a brief overview of approaches used to handle errors, and give some view into possible future enhancements.

Location: Grand Ballroom C Date: March 28, 2019 Time: 11:50 am - 12:10 pm Rich Graham, Mellanox