[eepro100] Command unit failed to mark command 00000000 ascomplete
-- what does it mean?
Robert C. Paulsen, Jr.
paulsen@texas.net
Tue, 18 Jul 2000 19:50:18 -0500
Donald,
Thanks for the reply.
The version of the driver (from the source) is: eepro100.c:v1.09r2 10/15/99.
This is from a SuSE 6.4 distribution. The card itself has the following
markings on the chip:
582557
L7233192
SL24Z
(c) 1989 1995
I have swapped out the eepro100 for a RealTek RTL8139 and am now using your
driver: rtl8139.c:v1.08 6/25/99. So far, so good! (And your reputation is
just fine with me!)
Donald Becker wrote:
>
> On Mon, 17 Jul 2000, Robert C. Paulsen, Jr. wrote:
>
> > Subject: [eepro100] Command unit failed to mark command 00000000 as complete
> -- what does it mean?
> >
> > My var log messages file has a few hundred of the following messages.
> > This started about 3 days ago.
>
> What driver version are you using?
>
> > Jul 17 14:46:21 home kernel: eth0: Command unit failed to mark command 00000000 as complete at 78644.
>
> This message indicates that the eepro100 you are using has a bug where it
> skipped marking a command as complete.
>
> When this occurs it means that the chip has corrupted its internal state.
> The driver can reset the chip, but the same problem will recur almost
> immediately. The driver recovers from this problem, but the recovery is
> slower than normal operation. The only full recovery seems to be a hard
> reset or powering off the system.
>
> This bug appears on no errata list that I have seen. It seems to affect
> only a few chip versions, and be triggered by only some motherboards.
>
> This bug was a nasty problem, and it gave me a bad reputation. It's the
> kind of bug where it would happen to someone, they would make a random
> change to the driver, and their updated driver would run reliably for a
> week. They would submit the change as a "bug fix". When I stated that
> their change didn't fix any obvious bug, they would stomp off and call me
> names. After all, they had seen my driver stop repeated in the span of a
> few minutes, and their driver just ran for a whole week without a problem.
> This very situation happened to Linus, and he never admitted that his
> changes to eepro100 didn't fix the problem. He just believed that I had
> some other hidden flaw in the driver.
>
> In v1.09s I added an explicit check for this case. Here is that change
> log entry -- look at entry #7. At this point I still wasn't certain that
> descriptor skipping was A Bug:
>
> ________________
> date: 1999/09/30 00:55:38; author: becker; state: Exp; lines: +283 -222
> eepro100.c v1.09s 9/29/99
> Updated to track the "kern-2.3" version.
>
> Added TX_QUEUE_UNFULL, the queue length where we once again accept Tx packets.
>
> Shuffled the kernel version compatibility code around and added local version
> of the pci-scan routines.
>
> Added a new PCI device ID 0x1029, reported by Russ Nelson.
>
> Changed clear_suspend() to use a byte write rather than an atomic bit op.
>
> Changed the Tx-timeout check to avoid false triggers. This included adding
> a last_cmd_time variable.
>
> Changed to struct net_device from struct device.
>
> Always write SCBCmd as byte-wide rather than word-wide.
>
> Added explicit descriptor-skipped check when scavenging the command list.
>
> Reset the chip when shutting down the interface, rather than just stopping it,
> to disable flow control packets that might be sent.
>
> Changed the ordering of command queue operations to eliminate the window
> where sp->cur_tx points to a net-yet-valid command. We should no longer need
> a lock in the interrupt routine, and the locked regions when adding a command
> are shorter. (Note: the locks have not been moved to take advantage of this.)
> ----------------------------
>
--
____________________________________________________________________
Robert Paulsen paulsen@texas.net