[Beowulf] [External] Checkpointing MPI applications

Prentice Bisbal pbisbal at pppl.gov
Mon Feb 20 15:46:10 UTC 2023


Chris,

Is anyone working on DMTCP or MANA going to start monitoring the 
dmtcp-forum mailing list? If you remember, I reached out here about 
having trouble getting DMTCP to work, and posted a call for help on that 
mailing list. I still subscribe to that mailing list, and I've seen 
several other people post to that mailing list, but no one from DMTCP or 
MANA ever replies there.

The DMTCP web page (https://dmtcp.sourceforge.io/contactUs.html) says 
that is the way to contact the DMCTP developers, but it seems that no 
one monitors that mailing list.

Prentice

On 2/18/23 3:42 PM, Christopher Samuel wrote:
> Hi all,
>
> The list has been very quiet recently, so as I just posted something 
> to the Slurm list in reply to the topic of checkpointing MPI 
> applications I thought it might interest a few of you here (apologies 
> if you've already seen it there).
>
> If you're looking to try checkpointing MPI applications you may want 
> to experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") 
> plugin for the DMTCP C/R effort here:
>
> https://github.com/mpickpt/mana
>
> We (NERSC) are collaborating with the developers and it is installed 
> on Cori (our older Cray system) for people to experiment with. The 
> documentation for it may be useful to others who'd like to try it out 
> - it's got a nice description of how it works too which even I, as a 
> non-programmer, can understand.
>
> https://docs.nersc.gov/development/checkpoint-restart/mana/
>
> Pay special attention to the caveats in our docs though!
>
> I've not used it myself, though I'm peripherally involved to give 
> advice on system related issues.
>
> I'm curious if there are other methods that people are using out there 
> for transparent checkpointing of MPI applications?
>
> All the best,
> Chris


More information about the Beowulf mailing list