[Beowulf] Kill zombies after a parallel run
Chris Samuel
csamuel at vpac.org
Tue May 2 17:28:20 PDT 2006
On Tuesday 02 May 2006 17:49, mg wrote:
> I use MPICH-1.2.5.2 to generate and run an FEM parallel application.
>
> During a parallel run, one process can crash, leaving the other
> processes run and OS commands have to be used for kill these zombies.
> So, does someone have a solution to avoid zombies after a failed
> parallel run: can the crashed process kill the other processes?
Wild guess time - this is being launched with PBS/Torque and your mpirun is
using SSH to launch the jobs ?
If that's the case it's not unusual (to quote Tom Jones), and we've seen the
same here at VPAC. What we do is encourage all users to use Pete Wyckoff's
excellent "mpiexec" program (now at version 0.81) at:
http://www.osc.edu/~pw/mpiexec/index.php
This talks directly to PBS using the TM interface - it retrieves the lists of
nodes allocated directly (so does not need to be told how many processes to
start or where) and uses TM to get the mom's to launch a process directly so
they have direct oversight of them.
When one process dies the mom's notice and mpiexec gets told, so it can reap
the rest of them.
Best of luck!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
More information about the Beowulf
mailing list