[eepro100] eepro100 failures
Donald Becker
becker@scyld.com
Fri, 24 Aug 2001 00:55:04 -0400 (EDT)
On Thu, 23 Aug 2001, Steinar Hauan wrote:
> I have a small cluster of dual-cpu P3 machines on RedHat 7.1++
> with network trouble using Intel Pro/100 adapters. Specifically,
> a diff on tcpdump for the Tx and Rx ends -- reproducible by both nfs
> and ftp -- shows that 1 bit is being flipped. The error is quite rare;
> only a few specific bit sequences located at specific offsets produce
> the error. A typical bit pattern is 5-12 bytes and cause errors if
> found at offset j*4+1 in a packet with j=1, 2, ... , N.
This is a very unusual errors.
I'll first rule out what could be causing the problem.
Bit flips on the wire
These would be caught by the Ethernet CRC. And besides, with
100baseTx bits are corrupted in groups of four, not one at a time.
Ethernet errors are reported in /proc/net/dev and you'll likely see
zillions before a undetected bit slips through. (The probability
depends heavily on the noise type.)
Bit flips inside the chip
Bit flips on the bus
These would both be caught by the TCP/IP checksum. No single bit
error will slip through. Additionally the PCI bus has parity check
which will catch single bit errors.
Note that my drivers do not use the Rx checksum support in the
eepro100 chips. Recent driver do show how to retrieve the partial
checksum, but this is good example of why it's questionable to use
the feature.
So that leaves us with memory, kernel or processor errors after the TCP/IP
checksum is computed.
With only a single bit flipped, it's unlikely to be a wild write to
memory from some other part of the kernel.
> Now here is the main cause of concern. Yesterday, I went out to my
> local computer store and bought 4 new ethernet cards.
>
> 1x 3Com PCI 3c590 Vortex 10Mbps (10Tx-HD)
> 1x 3Com 3c905B Cyclone 100baseTx (100Tx-FD)
> 1x Intellinet 10/100 PCI network card
> 1x SMC 1244TX Rev B (100TX-FD)
>
> the last two card use the RealTek RTL8139 chip (100Tx-FD).
>
> Whenever i boot my machines with one of the above cards along
> with the "noapic" kernel option, the errors go away.
My guess: the bit flips occur as memory corruption during bus master
writes from the PCI bus. The eepro100 chip is likely a '559 which is
PCI v2.1 and can generate very long PCI bursts. The 3c590, 905B and
rtl8139 will not generate packet-sized burst. Or perhaps the eepro100
has just the wrong PCI timing that triggers memory corruption.
[[ BTW, where did you find an _ancient_ 3c590?! Under a pyramid? ]]
> Could this be a
> driver error? What else could it be? Why does "noapic" make a difference
> with IRQs locked to specific pci slots? (and nothing to share with)
Different topic: "noapic" disables the extra APIC features and thus avoids
a long-standing Linux kernel bug.
Donald Becker becker@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993