[Beowulf] Redundant Array of Independent Memory - fork(Re: Checkpointing using flash)

Reuti reuti at staff.uni-marburg.de
Tue Sep 25 15:07:31 PDT 2012


Am 25.09.2012 um 12:19 schrieb Andrew Holway:

> <snip>
> Im pretty sure faulty hardware is the root cause of out fault
> tolerance problems :). In any case the main issue seems to be the loss
> of a chunk of your application memory when the node fail not so much
> the retransmission of messages. MPI has some kind of functionality
> inside to address fault tolerance anyway.

If you are interested: there was a lot of discussion about FT in MPI3. There is a mailing list:

http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

-- Reuti


More information about the Beowulf mailing list