bug in tulip_rx? corrected patch 2
Wolfgang Walter
wolfgang.walter@stusta.mhn.de
Fri Nov 26 11:51:51 1999
On Fri, Nov 26, 1999 at 01:48:43AM -0700, Edward Schlunder wrote:
> On Thu, 25 Nov 1999, Wolfgang Walter wrote:
>
> > another version of my patch. My previous one generally works but has a
> > problem when we get a lot of packets while we drop packets. This is
> > because I did not return a work_done>0 in this case.
>
> Okay, one of my systems consistently runs into the tulip_rx bug every
> night (other than Sun/Mon) between 1:00AM and 1:05AM. Your first and
> second patches didn't seem to make any differences. Last night, on the
To begin: thank you very much for you mail. This helps me a lot finding a
good way for fixing the bug.
The first one was buggy, the second one has problems if you constantly receive
a lot of packets - one must call it buggy, too :-).
> dot, my system go stuck in the "Suspended -- no Rx buffers" state with
> your second patch:
>
> Date: Thu, 25 Nov 1999 01:05:00 -0700
> From: Cron Daemon <root@ajusd.org>
> To: edward@ajusd.org
> Subject: Cron <root@imars> /root/bin/reset-eth0.pl
>
> tulip-diag.c:v1.19 10/2/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
> Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe400.
> Lite-On 82c168 PNIC chip registers at 0xe400:
> 00008000 01ff01ff 42422600 00f03810 00f03a10 026800d7 816ce202 0000eb28
> 000006aa 00000000 00f03b00 05454844 00000020 00000000 00000000 10000001
> 00000000 00000000 f0041385 000000bf 609641e1 00f03870 393e8810 0001e978
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> Port selection is MII, full-duplex.
> Transmit started, Receive started, full-duplex.
> The Rx process state is 'Suspended -- no Rx buffers'.
> The Tx process state is 'Idle'.
> The transmit unit is set to store-and-forward.
> A simplifed EEPROM data table was found.
> The EEPROM does not contain transceiver control information.
> MII PHY found at address 1, status 0x782d.
> MII PHY #1 transceiver registers:
> 3000 782d 0040 6212 01e1 41e1 0003 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 5000 0300 0000 0000 0000 0f74 0700 0000
> 003f 853e 0f00 ff00 002f 4000 80a0 000b.
> Couldn't ping 206.207.140.1. Assuming network offline, restarting eth0
>
> This third patch of yours I tried and tonight got a different result:
>
> tulip-diag.c:v1.19 10/2/99 Donald Becker (becker@cesdis.gsfc.nasa.gov)
> Index #1: Found a Lite-On 82c168 PNIC adapter at 0xe400.
> Lite-On 82c168 PNIC chip registers at 0xe400:
> 00008000 01ff0000 aaaa6100 00f03010 00f03210 02670055 810ca202 0000fbaf
> 00000000 00000000 00f032b0 30a4cc60 00000020 00000000 00000000 10000001
> 00000000 00000000 f0041385 000000bf 60fe000b 00f03060 393e8010 0001e978
> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> Port selection is MII, full-duplex.
> Transmit started, Receive started, full-duplex.
> The Rx process state is 'Waiting for packets'.
> The Tx process state is 'Idle'.
> The transmit threshold is 512.
> Interrupt sources are pending! CSR5 is 02670055.
> Tx done indication.
> Tx out of buffers indication.
> Link passed indication.
> Rx Done indication.
> A simplifed EEPROM data table was found.
> The EEPROM does not contain transceiver control information.
> MII PHY found at address 1, status 0x782d.
> MII PHY #1 transceiver registers:
> 3000 782d 0040 6212 01e1 41e1 0003 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 5000 0300 0000 0000 0000 00af 0100 0000
> 003f 853e 0f00 ff00 002f 4000 80a0 000b.
>
> The machine seems to be able to recieve packets fine, but now transmitting
> doesn't work very well. Pinging the machine from my home computer over
> cable modem normally looks something like:
>
> 64 bytes from 206.207.140.3: icmp_seq=0 ttl=242 time=69.1 ms
> 64 bytes from 206.207.140.3: icmp_seq=1 ttl=242 time=77.1 ms
> 64 bytes from 206.207.140.3: icmp_seq=2 ttl=242 time=57.5 ms
>
> But right now it looks like:
>
> 64 bytes from 206.207.140.5: icmp_seq=1290 ttl=242 time=22351.6 ms
> 64 bytes from 206.207.140.5: icmp_seq=1293 ttl=242 time=25455.2 ms
> 64 bytes from 206.207.140.5: icmp_seq=1291 ttl=242 time=27456.5 ms
>
> tcpdump -i eth0 shows the machine recieving packets seemingly fine.
> ifconfig eth0:
> eth0 Link encap:Ethernet HWaddr 00:A0:CC:55:8D:99
> inet addr:206.207.140.5 Bcast:206.207.140.255 Mask:255.255.255.0
> IPX/Ethernet 802.2 addr:000143AA:00A0CC558D99
> IPX/Ethernet 802.3 addr:00000002:00A0CC558D99
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:267115 errors:0 dropped:0 overruns:0 frame:0
> TX packets:214608 errors:121 dropped:0 overruns:2 carrier:3
> collisions:0 txqueuelen:100
> Interrupt:20 Base address:0xe400
>
Yes, this is what I expect. Short explanation.
In the interrupt handler we have a loop which finish when no further event
of the card arrived or if we processed to many of them. If there are to
many, the driver sets a timer on the card and disables all interrupts but
the timer interrupt. This way it controls that we are not overstressed.
In the criticial situation you have only one skb left. This means we call
tulip_rx for every single packet. Therefor we would be really overstressed.
And so sending is temporarly stopped, too.
What we would need is to temporarly suspend receiving.
I think I will have a patch tomorrow with the following strategy:
accept rx suspend mode (and therefor no skb can be allowed again),
but if we are in rx suspend mode when leaving the interrupt handler
then set the card timer
if the interrupt handling is called and we are already in rx suspend
mode: call tulip_rx (which tries to allocate new skbs).
This should work much better then the approach I use now.
Well, I had no experience with network-drivers at all. But now I get a feeling
for it :-).
I would be very glad if you would try out this alternativ approach (I'll send
you the patch). As I sometimes have to wait 2 or 3 days to get into this
skb-shortage it.
Greetings,
Wolfgang Walter
> --
> Ed Schlunder <zilym@SPAMNOT.asu.edu>
>