[eepro100] Re: Eepro100 1.36 on Alpha Linux 2.4.2 transmit timeout
Andrey Savochkin
saw@saw.sw.com.sg
Wed, 25 Apr 2001 06:42:18 -0700
Hello,
On Mon, Apr 23, 2001 at 02:42:26PM +0200, Cabaniols, Sebastien wrote:
> I am working on a cluster of ES40 Alpha servers (4 cpus)
> with DE600 boards under Alpha Linux 2.4.2smp. The machines
> have 8 Gigabytes of RAM (this has been an issue with
> the myrinet boards)
[snip]
>
> I get
>
> NETDEV WATCHDOG | eth0 : transmit time out.
> status 0090 0c00 at XXXXX/YYYYY command 000ca000
> wait_for_cmd_done timeout.
On Wed, Apr 25, 2001 at 02:38:58PM +0200, Cabaniols, Sebastien wrote:
> My hardware configuration is:
>
> AlphaServer ES40, 4 cpus, 8 Gigas of RAM
[snip]
> As long as I do not stress too much the network everything is fine. I can
> transfer
> little files but when I do big transfers: I see on the /var/log/messages:
>
> NETDEV WATCHDOG | eth0: transmit timeout
> status 0090 0c00 at xxxxxx/xxxxxx command
> 000ca000
> wait_cmd_done timeout.
>
>
> I if instist and launch another transfer, the system freeze, I loose the
> console and the
> network and I must do a hard reboot.
The timeouts are likely to be a result of a race condition in status word
update.
Try the patch quoted below with the proposed fix of using just
#if defined(__alpha__)
When it comes to a complete freeze of the system, I don't have any ideas why
it may happen.
Andrey
Date: Tue, 20 Feb 2001 17:26:37 -0500
From: Jay Estabrook <Jay.Estabrook@compaq.com>
To: Matt Wilson <msw@redhat.com>
Cc: Andrey Savochkin <saw@saw.sw.com.sg>, Richard Henderson <rth@redhat.com>,
Alan Cox <alan@redhat.com>, "Goshdigian, John" <John.Goshdigian@compaq.com>,
Pat Rago <prago@redhat.com>, George France <budan@excite.com>,
George France <george.france2@compaq.com>, Preston Brown <pbrown@redhat.com>
Subject: Re: PATCH: eepro100 hangs on Alpha - atomic bit ops
Message-ID: <20010220172637.B2182@linux04.mro.cpqcorp.net>
References: <20010219152247.A22256@saw.sw.com.sg> <20010219194603.A31644@devserv.devel.redhat.com> <20010219164949.A26051@redhat.com> <20010219171117.A23867@saw.sw.com.sg> <20010219171550.A26061@redhat.com> <20010219172406.A23932@saw.sw.com.sg> <20010219173437.A26085@redhat.com> <20010219174114.A24055@saw.sw.com.sg> <20010219174419.B26085@redhat.com> <20010220130313.X9499@devserv.devel.redhat.com>
On Tue, Feb 20, 2001 at 01:03:14PM -0500, Matt Wilson wrote:
>
> OK, new version of the patch attached.
> --- linux/drivers/net/eepro100.c.alpha Tue Feb 20 12:54:35 2001
> +++ linux/drivers/net/eepro100.c Tue Feb 20 12:57:33 2001
> @@ -341,14 +341,17 @@
> /* Clear CmdSuspend (1<<30) avoiding interference with the card access to the
> status bits. Previous driver versions used separate 16 bit fields for
> commands and statuses. --SAW
> - FIXME: it may not work on non-IA32 architectures.
> */
> -#if defined(__LITTLE_ENDIAN)
> -#define clear_suspend(cmd) ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x4000
> -#elif defined(__BIG_ENDIAN)
> -#define clear_suspend(cmd) ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x0040
> +#if defined(__alpha__) && !defined (__alpha_bwx__)
> +# define clear_suspend(cmd) clear_bit(30, &(cmd)->cmd_status);
> #else
> -#error Unsupported byteorder
> +# if defined(__LITTLE_ENDIAN)
> +# define clear_suspend(cmd) ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x4000
> +# elif defined(__BIG_ENDIAN)
> +# define clear_suspend(cmd) ((__u16 *)&(cmd)->cmd_status)[1] &= ~0x0040
> +# else
> +# error Unsupported byteorder
> +# endif
> #endif
I do NOT believe the above will completely solve the problem.
First, I assume that the cmd->cmd_status[] array is in HOST memory, ie
not PCI memory on the ethernet card. If it *is* PCI memory, AFAIK
there's no way to do atomic update on Alpha. End of discussion.
Second, BWX instructions won't buy you atomicity WRT the above operation.
On *all* Alphas, you MUST use the clear_bit() code.
Thirdly, you MUST guarantee that the clear_bit() operand is aligned
correctly for the operation (I believe it must be a 32-bit quantity,
and thus on a 32-bit ie 4-byte boundary). If it's not, the
load-locked and store-conditional instructions that are part of the
clear_bit() code will NOT operate correctly.
Bottom line: this
> +#if defined(__alpha__) && !defined (__alpha_bwx__)
should be just
> +#if defined(__alpha__)
--Jay++
-----------------------------------------------------------------------------
Jay A Estabrook Alpha Engineering - LINUX Project
Compaq Computer Corp. - MRO1-2/K20 (508) 467-2080
200 Forest Street, Marlboro MA 01752 Jay.Estabrook@compaq.com
-----------------------------------------------------------------------------