[Beowulf] Redundant Array of Independent Memory - fork(Re: Checkpointing using flash)
Reuti
reuti at staff.uni-marburg.de
Tue Sep 25 15:07:31 PDT 2012
Am 25.09.2012 um 12:19 schrieb Andrew Holway:
> <snip>
> Im pretty sure faulty hardware is the root cause of out fault
> tolerance problems :). In any case the main issue seems to be the loss
> of a chunk of your application memory when the node fail not so much
> the retransmission of messages. MPI has some kind of functionality
> inside to address fault tolerance anyway.
If you are interested: there was a lot of discussion about FT in MPI3. There is a mailing list:
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
-- Reuti
More information about the Beowulf
mailing list