[Beowulf] MPI + IB question
Christopher Samuel
samuel at unimelb.edu.au
Sun Nov 18 18:59:59 PST 2012
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15/11/12 22:02, Bogdan Costescu wrote:
> This is not really a crash... it actually tells you politely that
> it couldn't reach other ranks and terminates. The following lines:
>
> Process 1 ([[5187,1],1]) is on host: node24 Process 2
> ([[5187,1],0]) is on host: node32 BTLs attempted: self sm
>
> mean that the only qualified to continue BTLs were self and sm,
> none of which allows inter-node communications. Very likely tcp
> (which you disabled) was the only inter-node BTL available. So now
> it's up to you to find out why openib BTL could not be selected...
As Bogdan says you really need to investigate the IB on those two
nodes to see whether they are working or not.
Running ibstatus is probably a good start, to check that the card is
happily talking to the fabric, e.g.:
[root at merri001 ~]# ibstatus
Infiniband device 'mlx4_0' port 1 status:
default gid: fe80:0000:0000:0000:0002:c903:0007:3d51
base lid: 0x5c
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 40 Gb/sec (4X QDR)
There's also ibstat which gives you a bit more verbose info.
cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/
iEYEARECAAYFAlCpoK8ACgkQO2KABBYQAh8UawCfeemGfxREQTjInM0KyVz0oUhv
l/sAnjbgSMUfIc3q0cjJ47UZkF2DWoui
=CPT2
-----END PGP SIGNATURE-----
More information about the Beowulf
mailing list