tulip bug, alternative fix, #3 (tested)

Edward Schlunder kernel@venus.ajusd.org
Sun Nov 28 02:54:28 1999


On Sun, 28 Nov 1999, Wolfgang Walter wrote:

> How long transmitting does not work very well after hang situation is
> over (rx-dropped does not increase any more?). Does it help if you
> just kill some applications whioch allocated a lot of memory?

	It stayed in the bad transmitting state up until I reset it
("ifdown eth0; ifup eth0") at least 30 minutes after it went phooey. I
didn't try killing off any applications, but I doubt memory should be a
problem. It dies in the middle of the night (around 1AM) when nobody is
using the system and the system has 1GB of memory to begin with.

> It is interesting that you run out of skbs always at a certain time.
> What runs then (do you start special cron-jobs?). How much memory do
> you have and how big is your swap?

	Yes, it is rather interesting that it always konks out at around
1AM. Let me explain my set up... I have 8 servers: a main one (the one
that's konking out) with 1GB RAM and quad PPro 200s, and 7 site ones with
512MB RAM and dual PPro 200s. The site server each have their time
syncronized via xntpd to the main server. At 1:01AM, each server is set up
to run a cronjob that basically does "mt lock" to lock a tape in the SCSI
tape drive (of which there hasn't been one lately), fails, and then sends
off an email complaining to the main server. The main server also runs
nntpsend every hour at 1 minute past the hour (ie, *:01).  

	The tape drive cronjob seems to be part of the cause, as it
consistently crashes the network every night around 1AM -- except for the
two nights (Sun and Mon) that the tape drive cronjobs were turned off last
week. But -- it's not just the tape drive cornjob because one day I tried
setting all the servers to do the tape drive job hourly (ie, *:01) and it
didn't crash during the day (I only tried a few times and then set it
back).

> How long does your skript wait before it resets eth0? Because with my
> script Rx should remain in supsend mode until skbs are again
> available. Could you send me the output of ifconfig eth0? If your
> script detects the situation:

	The network reset script was set to run every 5 minutes and reset
immediately if it finds the net broken. I have now set it to run at *:05
hourly and to show "ifconfig eth0 ; sleep 2" ten times before resetting
the network card.

> this patch works for me. Would be nice if you could confirm it :-). Change
> compared with my previous one I sent you: TimerInt handling changed.

	Will do.. Since I've never had the system crash before on Sun or
Mon night, we will not really know if your patch is working or if it it
just never happens on Sun/Mon if it doesn't crash. ;-) But after Tuesday
night we will know for sure.

--
Ed Schlunder <zilym@SPAMNOT.asu.edu>