[Beowulf] IB problem with openmpi 1.2.8
Bill Wichser
bill at Princeton.EDU
Mon Jul 12 13:28:29 PDT 2010
Machine is an older Intel Woodcrest cluster with a two tiered IB
infrastructure with Topspin/Cisco 7000 switches. The core switch is a
SFS-7008P with a single management module which runs the SM manager.
The cluster runs RHEL4 and was upgraded last week to kernel
2.6.9-89.0.26.ELsmp. The openib-1.4 remained the same. Pretty much stock.
After rebooting, the IB cards in the nodes remained in the INIT state.
I rebooted the chassis IB switch as it appeared that no SM was running.
No help. I manually started an opensm on a compute node telling it to
ignore other masters as initially it would only come up in STANDBY.
This turned all the nodes' IB ports to active and I thought that I was done.
ibdiagnet complained that there were two masters. So I killed the
opensm and now it was happy. osmtest -f c/osmtest -f a comes back with
OSMTEST: TEST "All Validations" PASS.
ibdiagnet -ls 2.5 -lw 4x finds all my switches and nodes with
everything coming up roses.
The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
node count goes over 32 (or maybe 40). This worked fine in the past,
before the reboot. User apps are failing as well as IMB v3.2. I've
increased the timeout using the "mpiexec -mca btl_openib_ib_timeout 20"
which helped for 48 nodes but when increasing to 64 and 128 it didn't
help at all. Typical error message follow.
Right now I am stuck. I'm not sure what or where the problem might be.
Nor where to go next. If anyone has a clue, I'd appreciate hearing it!
Thanks,
Bill
typical error messages
[0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] from
woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
[0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] from
woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
[0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] from
woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.
This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.
Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:
* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 10). The actual timeout value used is calculated as:
4.096 microseconds * (2^btl_openib_ib_timeout)
See the InfiniBand spec 1.2 (section 12.7.34) for more details.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
DIFFERENT RUN:
[0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] from
woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
...
More information about the Beowulf
mailing list