[tulip-bug] tulip.c:v0.92 Rx suspended problem
Josip Loncaric
josip@icase.edu
Mon, 21 Aug 2000 10:26:24 -0400
On a few of our (supposedly identical) systems we tend to lose network
connectivity. I wrote a 'heartbeat' script to detect this, then do
'tulip-diag' and restart the interface (including removing/reloading the
tulip driver). Here is a fragment from the script log:
Sat Aug 19 07:15:17 EDT 2000 : heartbeat : n015 lost connectivity to fs1
tulip-diag.c:v2.00 4/19/2000 Donald Becker (becker@scyld.com)
http://www.scyld.com/diag/index.html
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xd000.
Port selection is MII, full-duplex.
Transmit started, Receive started, full-duplex.
The Rx process state is 'Suspended -- no Rx buffers'.
The Tx process state is 'Idle'.
The transmit threshold is 128.
Use '-a' or '-aa' to show device registers,
'-e' to show EEPROM contents, -ee for parsed contents,
or '-m' or '-mm' to show MII management registers.
Sat Aug 19 07:16:26 EDT 2000 : heartbeat : n015 connectivity restored to
fs1
The above may be related to the condition reported in /var/log/messages
a couple of minutes earlier:
Aug 19 07:13:22 n015 kernel: eth0: Restarted Rx at 2632898 / 2632898.
FYI, the above is observed on a system running Red Hat 6.2 kernel
2.2.16-3 updated to tulip.c:v0.92 4/17/2000. The hardware includes Asus
P2B motherboard (440BX chipset, single PII/400) and NetGear FA310TX
network card w/ Lite-On chipset. We have another 31 identically
configured systems which do *not* have the above problem, so the cause
is probably some intricate hardware interaction. Perhaps changing the
network card would help -- but we are tired of experimenting with
network cards, so I now use my 'heartbeat' script to reset the tulip
driver. It would be much nicer if the tulip driver could detect the
above problem and recover automatically -- or better yet, avoid the
problem entirely.
Sincerely,
Josip
P.S. 'heartbeat' concludes that the interface is dead when pinging two
different servers a minute apart fails in both cases. Recovery does
ifdown/rmmod/modprobe/ifup to reload the tulip driver, then pings a
server again. If the server responds, we're back in business; otherwise
the script pauses for 10 minutes then tries again. Most of our nodes do
not have a problem, but this one has a problem about every 5 days.
--
Dr. Josip Loncaric, Senior Staff Scientist mailto:josip@icase.edu
ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134