"Transmit timed out" with EtherExpress Pro100B
    Donald Becker 
    becker@cesdis1.gsfc.nasa.gov
    Wed Oct  7 09:13:15 1998
    
    
  
On 7 Oct 1998, Osma Ahvenlampi wrote:
> > If you have any ideas I'd really appricate them. The error I get is
> > "status 0050 command 0000" generated by the function
> > "speedo_tx_timeout". I believe the transmitter _has_ hung, as the system
> > locks totally when I remove the transmitter restart, and I suspect it is
> > down to a bug in the chip I outlined in my previous e-mail.
> 
> okay.. My problem (same error message) went away with this patch:
>  /* Maximum number of multicast addresses to filter (vs. rx-all-multicast) */
> -static int multicast_filter_limit = 64;
> +static int multicast_filter_limit = 0;
You can change this value when loading the module:
   options eepro100 multicast_filter_limit=0
> Try it, and see what happens. I think it multicast_filter_limit might
> be safely set to 3 before problems start appearing, but I'm not
> certain about it. I think I'll try it out now. Will find out if it
> works in an hour at max..
That would be useful data.
The multicast filter on the EEPro100 is loaded by a pseudo-transmit command.
Three multicast addresses is the breakpoint between putting the multicast
address filter list in a single transmit descriptor vs. allocating a longer
special descriptor just for loading the multicast list.  (line 1487
> Anyway, if you look into the code, you'll see the variable is used in
> set_rx_mode, and there really shouldn't be any direct relation between
> the two functions (set_rx_mode and speedo_start_xmit). These are both
> entry points in the device structure, so they're functions called from
> other parts of the kernel. Apparently the 100 lines beginning from
> line 1487 is the cause of this problem. What it exactly does I really
> don't have a clue of - I've never been good with hardware programming,
> and without even any documentation, it might as well be a binary dump
> for all I can understand of it. In any case, the commands are sent to
> the device by appending them into the Tx queue, so it really isn't
> inconceivable that the queue is getting corrupted. That would be
> consistent with the physical evidence (massive packet flooding
> monitored on other hosts on the network).
That was the earlier problem.  Prior to v1.03 the driver would occasionally
corrupt the transmit list when adding a long SetMulticastFilter command.
The driver keeps a long SetMulticastFilter command in sp->mc_setup_frm, with
the current length in sp->mc_setup_frm_len.  If the multicast list grows
so that it won't fit in the current command, the driver allocates a longer
command with some slack (line 1531).  This new allocation causes a bunch of
problems.
[[The whole concept of changing the receive state by queueing a
command on a potentially-long transmit is broken.  Intel chips do this for
obscure historical reasons.  But it causes unpredictable delays when
changing the Rx mode and might lead to multiple commands on the Tx
queue.  Worse, the Tx queue command processing might be stopped by a
hardware flow control signal from the other end.]]
!!!Hmmm, writing up this descriptions points out a potential race condition:
if the multicast list is rapidly extended, the list might grow again before
the first command is processed.  Here is a patch for line 1530 that wastes a
little memory, but avoids complicated code to fix the problem:
	/* Allocate a new frame, 10bytes + addrs, with a few
	   extra entries for growth. */
	if (sp->mc_setup_frm)
		kfree(sp->mc_setup_frm);
-	sp->mc_setup_frm_len = 10 + dev->mc_count*6 + 24;
+	/* Avoid growth allocation race by allocating a max-sized entry. */
+	sp->mc_setup_frm_len = 10 + multicast_filter_limit*6 + 6;
Hmmm, let me read the code another dozen times.  I suspect that there is
another race condition that might occur...  I don't know if the best
solution is allocate a new command each time, which could results in high
kmalloc()/kfree() cost, or to avoid queueing more than one SetMulticastFilter
command at a time, which increases the latency for multicast filter commands
to take effect.
Donald Becker					  becker@cesdis.gsfc.nasa.gov
USRA-CESDIS, Center of Excellence in Space Data and Information Sciences.
Code 930.5, Goddard Space Flight Center,  Greenbelt, MD.  20771
301-286-0882	     http://cesdis.gsfc.nasa.gov/people/becker/whoiam.html