[Beowulf] mpi alltoall help
Michael Di Domenico
mdidomenico4 at gmail.com
Tue Oct 10 08:58:59 PDT 2017
i posted a copy of this to openmpi mailing list, but i'm curious if
anyone here can lend suggestions on troubleshooting
---
i'm getting stuck trying to run some fairly large IMB-MPI alltoall
tests under openmpi 2.0.2 on rhel 7.4
i have two different clusters, one running mellanox fdr10 and one
running qlogic qdr
if i issue
mpirun -n 1024 ./IMB-MPI1 -npmin 1024 -iter 1 -mem 2.001 alltoallv
the job just stalls after the "List of Benchmarks to run: Alltoallv"
line outputs from IMB-MPI
if i switch it to alltoall the test does progress
often when running various size alltoall's i'll get
"too many retries sending message to <>:<>, giving up
i'm able to use infiniband just fine (our lustre filesystem mounts
over it) and i have other mpi programs running
it only seems to stem when i run alltoall type primitives
any thoughts on debugging where the failures are, i might just need to
turn up the debugging, but i'm not sure where
More information about the Beowulf
mailing list