[Beowulf] substantial RX packet drops during Pallas over e1000	(Rocks 4.1)
    Jeff Johnson 
    jeff.johnson at wsm.com
       
    Tue May 16 23:44:33 PDT 2006
    
    
  
Greetings,
    Running Rocks 4.1 on a 30 node system and seeing serious RX packet
loss, drops and overruns while running heavy MPI i/o over e1000. I have
replaced cabling, and switches, updated e1000 drivers, ran multiple 
kernels, etc. No  modifications seem to affect the issue. I am pursuing 
a hardware resolution with Intel and Supermicro but I am posting here in 
case someone has seen similar events.
    System details:
       30 nodes - Intel Pentium-D 840, 4GB RAM, 80GB SATA
             Supermicro PDSMI motherboard
             Intel 82573E and 82573L gigabit ethernet controllers
             (only one network connected)
             2.6.9-34.ELsmp  /*and*/   2.6.16.11
             e1000-7.0.38-1 driver
    Run details:
       mpirun -nolocal -np 18 -machinefile /home/test/machines.20-29
/home/test/IMB-MPI1 Alltoall -npmin 18 -msglen /home/test/Lengths
(msglen values of 32, 256, 512 and 1024 have been run exclusively, each 
resulting in packet drops)
   Packet drop example: (other nodes post similar numbers)
           RX packets:1843133 errors:0 dropped:1245 overruns:0 frame:0
           TX packets:1764828 errors:0 dropped:0 overruns:0 carrier:0
    I have tried increasing the e1000 RxDescriptors value to the maximum
of 4096 thinking that the Alltoall test may be overtasking receive
buffer resources but the drops still occur.
    At Intel's advice I set arp filtering but it did nothing to change 
the behavior of the problem. (/proc/sys/net/ipv4/conf/all/arp_filter)
Any ideas?
--Jeff
    
    
More information about the Beowulf
mailing list