[Beowulf] MPICH-1.2.5 hangs on 16 node cluster
Greg Lindahl
lindahl at pathscale.com
Sun Nov 21 10:10:43 PST 2004
On Fri, Nov 19, 2004 at 02:37:18PM +0530, Sreenivasulu Pulichintala wrote:
> I see some strange behavior of the MPICH stack when running on a 16 node
> cluster.
Is this stock MPICH? If not, you haven't included very much info about
what you're actually running. In any case:
> On node 2
> --------
> #0 0x0000000041efb877 in poll_rdma_buffer ()
> #1 0x0000000041efd2cb in viutil_spinandwaitcq ()
> #2 0x0000000041efba1e in MPID_DeviceCheck ()
> #3 0x0000000041f0a36b in MPID_RecvComplete ()
> #4 0x0000000041f09ead in MPID_RecvDatatype ()
> #5 0x0000000041f03569 in MPI_Recv ()
> #6 0x0000000041eef42d in mpi_recv_ ()
> #7 0x0000000041c0b153 in remdupslave_ ()
> #8 0x000000000000cf6b in ?? ()
> #9 0x000000000000c087 in ?? ()
> #10 0x000000000002f4b4 in ?? ()
> #11 0x000000000000c503 in ?? ()
> #12 0x000000000000c575 in ?? ()
> #13 0x000000000000040c in ?? ()
> #14 0x00000000401ae313 in dynai_ ()
> #15 0x0000000040006d08 in frame_dummy ()
This process seems to be in a Fortran mpi_recv() call and NOT in
All_Reduce. This could be a programming error in your program.
But it isn't clear if this stack trace isn't corrupt.
-- greg
p.s. It would be better if you posted to mailing lists in straight
text instead of text and html.
More information about the Beowulf
mailing list