bug in tulip_rx? corrected patch 2

Wolfgang Walter wolfgang.walter@stusta.mhn.de
Fri Nov 26 19:31:17 1999


On Fri, Nov 26, 1999 at 05:51:46PM +0100, Wolfgang Walter wrote:
> On Fri, Nov 26, 1999 at 01:48:43AM -0700, Edward Schlunder wrote:
> > On Thu, 25 Nov 1999, Wolfgang Walter wrote:
> > 
> > > another version of my patch. My previous one generally works but has a
> > > problem when we get a lot of packets while we drop packets. This is
> > > because I did not return a work_done>0 in this case.
> > 
> > Okay, one of my systems consistently runs into the tulip_rx bug every
> > night (other than Sun/Mon) between 1:00AM and 1:05AM.  Your first and
> > second patches didn't seem to make any differences. Last night, on the
> 
> To begin: thank you very much for you mail. This helps me a lot finding a
> good way for fixing the bug.
> 
> The first one was buggy, the second one has problems if you constantly receive
> a lot of packets - one must call it buggy, too :-).
> 
> > dot, my system go stuck in the "Suspended -- no Rx buffers" state with
> > your second patch:
> > 
> > Date: Thu, 25 Nov 1999 01:05:00 -0700
> > From: Cron Daemon <root@ajusd.org>
> > To: edward@ajusd.org
> > Subject: Cron <root@imars> /root/bin/reset-eth0.pl
> > 
> > tulip-diag.c:v1.19 10/2/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
> > Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe400.
> > Lite-On 82c168 PNIC chip registers at 0xe400:
> >   00008000 01ff01ff 42422600 00f03810 00f03a10 026800d7 816ce202 0000eb28
> >   000006aa 00000000 00f03b00 05454844 00000020 00000000 00000000 10000001
> >   00000000 00000000 f0041385 000000bf 609641e1 00f03870 393e8810 0001e978
> >   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> >  Port selection is MII, full-duplex.
> >  Transmit started, Receive started, full-duplex.
> >   The Rx process state is 'Suspended -- no Rx buffers'.
> >   The Tx process state is 'Idle'.
> >   The transmit unit is set to store-and-forward.
> >  A simplifed EEPROM data table was found.
> >  The EEPROM does not contain transceiver control information.
> >  MII PHY found at address 1, status 0x782d.
> >  MII PHY #1 transceiver registers:
> >    3000 782d 0040 6212 01e1 41e1 0003 0000
> >    0000 0000 0000 0000 0000 0000 0000 0000
> >    5000 0300 0000 0000 0000 0f74 0700 0000
> >    003f 853e 0f00 ff00 002f 4000 80a0 000b.
> > Couldn't ping 206.207.140.1. Assuming network offline, restarting eth0
> > 
> > This third patch of yours I tried and tonight got a different result:
> > 
> > tulip-diag.c:v1.19 10/2/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
> > Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe400.
> > Lite-On 82c168 PNIC chip registers at 0xe400:
> >   00008000 01ff0000 aaaa6100 00f03010 00f03210 02670055 810ca202 0000fbaf
> >   00000000 00000000 00f032b0 30a4cc60 00000020 00000000 00000000 10000001
> >   00000000 00000000 f0041385 000000bf 60fe000b 00f03060 393e8010 0001e978
> >   00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> >  Port selection is MII, full-duplex.
> >  Transmit started, Receive started, full-duplex.
> >   The Rx process state is 'Waiting for packets'.
> >   The Tx process state is 'Idle'.
> >   The transmit threshold is 512.
> >  Interrupt sources are pending!  CSR5 is 02670055.
> >    Tx done indication.
> >    Tx out of buffers indication.
> >    Link passed indication.
> >    Rx Done indication.
> >  A simplifed EEPROM data table was found.
> >  The EEPROM does not contain transceiver control information.
> >  MII PHY found at address 1, status 0x782d.
> >  MII PHY #1 transceiver registers:
> >    3000 782d 0040 6212 01e1 41e1 0003 0000
> >    0000 0000 0000 0000 0000 0000 0000 0000
> >    5000 0300 0000 0000 0000 00af 0100 0000
> >    003f 853e 0f00 ff00 002f 4000 80a0 000b.
> > 
> > The machine seems to be able to recieve packets fine, but now transmitting
> > doesn't work very well. Pinging the machine from my home computer over
> > cable modem normally looks something like:
> > 
> > 64 bytes from 206.207.140.3: icmp_seq=0 ttl=242 time=69.1 ms
> > 64 bytes from 206.207.140.3: icmp_seq=1 ttl=242 time=77.1 ms
> > 64 bytes from 206.207.140.3: icmp_seq=2 ttl=242 time=57.5 ms
> > 
> > But right now it looks like:
> > 
> > 64 bytes from 206.207.140.5: icmp_seq=1290 ttl=242 time=22351.6 ms
> > 64 bytes from 206.207.140.5: icmp_seq=1293 ttl=242 time=25455.2 ms
> > 64 bytes from 206.207.140.5: icmp_seq=1291 ttl=242 time=27456.5 ms
> > 
> > tcpdump -i eth0 shows the machine recieving packets seemingly fine.
> > ifconfig eth0:
> > eth0      Link encap:Ethernet  HWaddr 00:A0:CC:55:8D:99  
> >           inet addr:206.207.140.5  Bcast:206.207.140.255 Mask:255.255.255.0
> >           IPX/Ethernet 802.2 addr:000143AA:00A0CC558D99
> >           IPX/Ethernet 802.3 addr:00000002:00A0CC558D99
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:267115 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:214608 errors:121 dropped:0 overruns:2 carrier:3
> >           collisions:0 txqueuelen:100 
> >           Interrupt:20 Base address:0xe400 
> > 
> 
> Yes, this is what I expect. Short explanation.
> 
> In the interrupt handler we have a loop which finish when no further event
> of the card arrived or if we processed to many of them. If there are to
> many, the driver sets a timer on the card and disables all interrupts but
> the timer interrupt. This way it controls that we are not overstressed.
> 
> In the criticial situation you have only one skb left. This means we call
> tulip_rx for every single packet. Therefor we would be really overstressed.
> And so sending is temporarly stopped, too.
> 
> What we would need is to temporarly suspend receiving.

I have to correct me: the code does not disable all interrupts from the
card, only the last one processed. But these may be send interrupts, too.

> 
> I think I will have a patch tomorrow with the following strategy:
> 
> 	accept rx suspend mode (and therefor no skb can be allowed again),
> 	but if we are in rx suspend mode when leaving the interrupt handler
> 	then set the card timer
> 
> 	if the interrupt handling is called and we are already in rx suspend
> 	mode: call tulip_rx (which tries to allocate new skbs).
> 
> This should work much better then the approach I use now.

I now implemented that and do a short test (if it runs under normal
conditions), then I'll post it.

> 
> Well, I had no experience with network-drivers at all. But now I get a feeling
> for it :-).
> 
> I would be very glad if you would try out this alternativ approach (I'll send
> you the patch). As I sometimes have to wait 2 or 3 days to get into this
> skb-shortage it.

I now had this situation again and my patch really worked fine: the interface
continued to work fine after 55 packets got dropped.

Greetings,

Wolfgang Walter