[eepro100] Transmit errors with i8255X cards using eepro100 driver on compaq alpha
Jim Matthews
jmatthew@tabdemo.larc.nasa.gov
Fri, 20 Apr 2001 11:36:59 -0400 (EDT)
I am in the process of setting up an alpha linux (beowulf) cluster but
am running into some network problems with two i8255X cards. Originally
I had setup these cards with channel bonding but have disabled it so I
could more easily debug the problems I am having. Right now each
machine on the cluster has 3 cards, one is a dec21143 (using de4x5
driver) the other 2 are the i8255X (using EtherExpress eepro100 driver)
cards (also from compaq). Each of the cards is on a different subnet.
To test throughput robustness I have sent large amounts of data (5
/dev/zero cats) to other cluster nodes. I have found that if I saturate
one of the i8255X cards data will transfer without error. The moment
I start to send data (eg: initiate a cat of /dev/zero) over one of the
other interfaces, either the 21143 or the other i8255X I will begin to
get transmit errors on one of the i8255X cards. If I send data over
both i8255X cards I will get transmit errors on both i8255X interfaces,
but I never see transmit errors on the 21143 interface. Transmit errors
are reported on all nodes getting sent the data.
The three interface cards are connected to 2 CISCO 3500 switches. One
of the switches is segmented into 2 VLAN to isolate traffic between
interfaces. When I observe a transmit error in linux I also notice that
I will see the switch try to renegotiate the connection for the
interface which reported the error. I am assuming that this is a driver
problem but the switch renegotiation made me wonder about a hardware
problem or switch configuration, but since the 21143 interface works w/o
error connected to either switch I am assuming it is a driver issue.
Another message I am seeing in the i8255X debug (included at the end) is
the "TX ring dump". I notice that "TX_RING_SIZE" is set in the
eepro100.c source code. I was wondering if the setting for this value
might effect the problem...?
The alphas have the following configuration:
Compaq alpha XP1000 21264 667mHz
1.2 GB RAM
1 21143 card
2 i8255X cards
Redhat Linux v7.0
Kernel 2.4.2 (includes latest eepro100 driver, v1.36)
Do you have any idea what would be causing this problem?
Help is greatly appreciated.
Thanks,
--Jim Matthews
--System Administrator
--Raytheon Information Services
--NASA Langley Research Center
Additional info follows:
The following is detection of the two i82555 cards by the 2.4.2 kernel's
eepro100 driver:
Apr 19 15:56:50 cfdalc2n1 kernel: eepro100.c:v1.09j-t 9/29/99 Donald
Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
Apr 19 15:56:50 cfdalc2n1 kernel: eepro100.c: $Revision: 1.36 $
2000/11/17 Modified by Andrey V. Savochkin <saw@saw.sw.com.sg> and
others
Apr 19 15:56:50 cfdalc2n1 kernel: eth1: OEM i82557/i82558 10/100
Ethernet, 00:50:8B:B4:A2:5C, IRQ 36.
Apr 19 15:56:50 cfdalc2n1 kernel: Board assembly 726837-017, Physical
connectors present: RJ45
Apr 19 15:56:50 cfdalc2n1 kernel: Primary interface chip i82555 PHY
#1.
Apr 19 15:56:50 cfdalc2n1 kernel: General self-test: passed.
Apr 19 15:56:50 cfdalc2n1 kernel: Serial sub-system self-test: passed.
Apr 19 15:56:50 cfdalc2n1 kernel: Internal registers self-test:
passed.
Apr 19 15:56:50 cfdalc2n1 kernel: ROM checksum self-test: passed
(0x04f4518b).
Apr 19 15:56:50 cfdalc2n1 kernel: eth2: OEM i82557/i82558 10/100
Ethernet, 00:50:8B:B4:48:DA, IRQ 32.
Apr 19 15:56:50 cfdalc2n1 kernel: Board assembly 726837-017, Physical
connectors present: RJ45
Apr 19 15:56:50 cfdalc2n1 kernel: Primary interface chip i82555 PHY
#1.
Apr 19 15:56:50 cfdalc2n1 kernel: General self-test: passed.
Apr 19 15:56:50 cfdalc2n1 kernel: Serial sub-system self-test: passed.
Apr 19 15:56:50 cfdalc2n1 kernel: Internal registers self-test:
passed.
Apr 19 15:56:50 cfdalc2n1 kernel: ROM checksum self-test: passed
(0x04f4518b).
These are the syslog messages I am seeing relating to transmit time out:
Apr 19 16:55:25 cfdalc2n1 kernel: NETDEV WATCHDOG: eth2: transmit timed
out
Apr 19 16:55:25 cfdalc2n1 kernel: eth2: Transmit timed out: status 0050
0c00 at 10218925/10218953 command 000c0000.
Apr 19 16:56:27 cfdalc2n1 kernel: NETDEV WATCHDOG: eth3: transmit timed
out
Apr 19 16:56:27 cfdalc2n1 kernel: eth3: Transmit timed out: status 0050
0c00 at 9000445/9000473 command 000c0000.
This is a longer "debug" version of the above transmit errors for one
card:
NETDEV WATCHDOG: eth2: transmit timed out
eth2: Transmit timed out: status 0050 0c00 at 10201183/10201211 command
000c0000.
eth2: Tx ring dump, Tx queue 10201211 / 10201183:
eth2: 0 200ca000.
eth2: 1 000ca000.
eth2: 2 000ca000.
eth2: 3 000ca000.
eth2: 4 000ca000.
eth2: 5 000ca000.
eth2: 6 000ca000.
eth2: 7 000ca000.
eth2: 8 200ca000.
eth2: 9 000ca000.
eth2: 10 000ca000.
eth2: 11 000ca000.
eth2: 12 000ca000.
eth2: 13 000ca000.
eth2: 14 000ca000.
eth2: 15 000ca000.
eth2: 16 200ca000.
eth2: 17 000ca000.
eth2: 18 000ca000.
eth2: 19 000ca000.
eth2: 20 000ca000.
eth2: 21 000ca000.
eth2: 22 000ca000.
eth2: 23 000ca000.
eth2: 24 200ca000.
eth2: 25 000ca000.
eth2: 26 400ca000.
eth2: =27 000ca000.
eth2: 28 000ca000.
eth2: 29 000ca000.
eth2: 30 000ca000.
eth2: * 31 000c0000.
eth2: Printing Rx ring (next to receive into 5354158, dirty index
5354158).
eth2: 0 00000001.
eth2: 1 00000001.
eth2: 2 00000001.
eth2: 3 00000001.
eth2: 4 00000001.
eth2: 5 00000001.
eth2: 6 00000001.
eth2: 7 00000001.
eth2: 8 00000001.
eth2: 9 00000001.
eth2: 10 00000001.
eth2: 11 00000001.
eth2: 12 00000001.
eth2: l 13 c0000001.
eth2: *=14 00000001.
eth2: 15 00000001.
eth2: 16 00000001.
eth2: 17 00000001.
eth2: 18 00000001.
eth2: 19 00000001.
eth2: 20 00000001.
eth2: 21 00000001.
eth2: 22 00000001.
eth2: 23 00000001.
eth2: 24 00000001.
eth2: 25 00000001.
eth2: 26 00000001.
eth2: 27 00000001.
eth2: 28 00000001.
eth2: 29 00000001.
eth2: 30 00000001.
eth2: 31 00000001.