[Beowulf] MPICH-1.2.5 hangs on 16 node cluster

Sun Nov 21 10:10:43 PST 2004

On Fri, Nov 19, 2004 at 02:37:18PM +0530, Sreenivasulu Pulichintala wrote:

> I see some strange behavior of the MPICH stack when running on a 16 node
> cluster.

Is this stock MPICH? If not, you haven't included very much info about
what you're actually running. In any case:

> On node 2
> --------
> #0  0x0000000041efb877 in poll_rdma_buffer ()
> #1  0x0000000041efd2cb in viutil_spinandwaitcq ()
> #2  0x0000000041efba1e in MPID_DeviceCheck ()
> #3  0x0000000041f0a36b in MPID_RecvComplete ()
> #4  0x0000000041f09ead in MPID_RecvDatatype ()
> #5  0x0000000041f03569 in MPI_Recv ()
> #6  0x0000000041eef42d in mpi_recv_ ()
> #7  0x0000000041c0b153 in remdupslave_ ()
> #8  0x000000000000cf6b in ?? ()
> #9  0x000000000000c087 in ?? ()
> #10 0x000000000002f4b4 in ?? ()
> #11 0x000000000000c503 in ?? ()
> #12 0x000000000000c575 in ?? ()
> #13 0x000000000000040c in ?? ()
> #14 0x00000000401ae313 in dynai_ ()
> #15 0x0000000040006d08 in frame_dummy ()

This process seems to be in a Fortran mpi_recv() call and NOT in
All_Reduce. This could be a programming error in your program.
But it isn't clear if this stack trace isn't corrupt.

-- greg

p.s. It would be better if you posted to mailing lists in straight
text instead of text and html.