MPICH Problem on a Channel Bonded Mini-Cluster
Gordon Gere
gagere at uwo.ca
Tue Jun 12 10:42:24 PDT 2001
Hi,
I am running a small test cluster of 4 1Ghz Athlon computers
channel bonded using Intel EtherExpress Pro+ (82559) network cards. I
will quickly sum up my problem and post all of the related information at
the bottom of this message.
When channel bonded the computers perform fine, and netperf shows
a roughly 95% improvement in network bandwidth. I run a plane wave
geometry optimization on 32 water molecules using CPMD as a test for
performance. This test utilizes the ALL to ALL function of MPI
extensively (hence are hopes channel bonding would improve performance).
When this test is run using the MPICH implementation of MPI the program
will run for approx. 5-8 minutes and then stall. By stall I mean, the
program continues to be active and use 95% of the cpu time, but doesn't
output any more information and doesn't generate network traffic.
Eventually the program will report an error regarding communication (see
below). I have tested the same exact input and program using the LAM
implementation of MPI and it runs fine, finishing slightly faster than the
non-channel bonded time. We have found using single channel networking
that LAM is much faster if only using 2-6 nodes but starts losing
performance over 8 (see performance numbers at end) and MPICH although
slower than LAM at 2-6 nodes continues to scale well up to 14 nodes. We
wish to use MPICH if possible. I hope someone can shed some light on why
our test doesn't finish using MPICH but does using LAM.
Error Message:
net_recv failed for fd = 7
p2_730: p4_error: net_recv read, errno = : 110
rm_l_2_731: p4_error: interrupt SIGINT: 2
p1_739: (rm_l_1_740: p4_error: net_recv read: probable EOF on socket: 1
bm_list_19021: p4_error: net_recv read: probable EOF on socket: 1
System info:
Running RedHat 6.2 but upgraded the kernel to 2.2.16-22.
Network info (ifconfig output):
bond0 Link encap:Ethernet HWaddr 00:02:B3:1C:D5:32
inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:78 errors:0 dropped:0 overruns:0 frame:0
TX packets:33 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
eth0 Link encap:Ethernet HWaddr 00:02:B3:1C:D5:32
inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:39 errors:0 dropped:0 overruns:0 frame:0
TX packets:17 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:5 Base address:0xa000
eth1 Link encap:Ethernet HWaddr 00:02:B3:1C:D5:32
inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:39 errors:0 dropped:0 overruns:0 frame:0
TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:10 Base address:0xc000
I bonded using the latest drivers (v1.12) from ftp.scyld.com, using the
following commands:
modprobe bonding
ifconfig bond0 192.168.0.1 netmask 255.255.255.0 up
ifenslave bond0 eth0
ifenslave bond0 eth1
(and made changes to /etc/sysconfig/network-scripts/ifcfg-*)
MPICH/LAM installation: Compiled and installed MPICH and LAM using the
Portland Group Compilers.
Results:
NetPerf results on eepro100 NICs:
singe NIC 94.02 Mbits/s
bonded NICs 187.93 Mbits/s
These results are per step in seconds using CPMD with 4 nodes:
LAM MPICH
single NIC 90s 119s
bonded NICs 80s error (88s)*
* Although MPICH with channel bonding was able to complete 1-2 steps it
was not able to finish. I leave the numbers in for reference sake.
Results for single channel CPMD runs on the same input for LAM and MPICH
per number of nodes:
LAM MPICH
1 cpu 217s 217s
2 cpu 117s 175s
4 cpu 84s 117s
6 cpu 75s 98s
8 cpu 114s 60s
12 cpu 120s 48s
14 cpu n/a 43s
Since the only time an error occurs is when I use channel bonding
it is obvious that something related to the network configuration is the
problem. I have thought of a couple possible problems and solutions and
was wondering if people wouldn't mind offering advice on them.
First, I have noticed that some people have found problems with
the bonding.o module from the 2.2.16 kernel and suggested using the
bonding.o from 2.2.17, however I have not been able to find such a module.
I have thought about upgrading (again) to the 2.2.19 kernel available from
ftp.redhat.com, but havn't heard anything about it yet.
Since the error generated by the program is rather oblique it is
hard to diagnose the problem. Some related errors I get are transmit and
receive overruns (from ifconfig) on the interfaces used to communicate to
the cluster. This points me towards some sort of network problem, however
I get the same receive and transmit overruns when running LAM which still
finishes. This leads me to think that LAM has been implemented to be more
stable and/or MPICH runs into some sort of problem handling the
overruns. Perhaps the bonding.o module is the problem or the network
cards/drivers under the bonding situation fail without giving errors.
Thanks in advance.
Gordon Gere
University of Western Ontario
(519)661-2111 ext. 86353
London, Ontario
More information about the Beowulf
mailing list