[Beowulf] IB problem with openmpi 1.2.8

Mon Jul 12 13:28:29 PDT 2010

Machine is an older Intel Woodcrest cluster with a two tiered IB 
infrastructure with Topspin/Cisco 7000 switches.  The core switch is a 
SFS-7008P with a single management module which runs the SM manager.  
The cluster runs RHEL4 and was upgraded last week to kernel 
2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much stock.

After rebooting, the IB cards in the nodes remained in the INIT state.  
I rebooted the chassis IB switch as it appeared that no SM was running.  
No help.  I manually started an opensm on a compute node telling it to 
ignore other masters as initially it would only come up in STANDBY.  
This turned all the nodes' IB ports to active and I thought that I was done.

ibdiagnet complained that there were two masters.  So I killed the 
opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back with 
OSMTEST: TEST "All Validations" PASS. 

ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with 
everything coming up roses.

The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the 
node count goes over 32 (or maybe 40).  This worked fine in the past, 
before the reboot.  User apps are failing as well as IMB v3.2.  I've 
increased the timeout using the "mpiexec -mca btl_openib_ib_timeout 20" 
which helped for 48 nodes but when increasing to 64 and 128 it didn't 
help at all.  Typical error message follow.

Right now I am stuck.  I'm not sure what or where the problem might be.  
Nor where to go next.  If anyone has a clue, I'd appreciate hearing it!

Thanks,
Bill

typical error messages

[0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
[0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
[0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
--------------------------------------------------------------------------
--------------------------------------------------------------------------

DIFFERENT RUN:

[0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] from 
woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY 
EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
...