AW: Some comments...
Stauffer, Walter (Exchange)
w.stauffer@galenica.ch
Thu Jan 13 05:13:56 2000
Am I the only person on earth having the following problem with
i82558B NIC's ?
- put a box on the net which waits for requests (i.e. a server)
- send a request (connect to FTP, WWW, even Ping)
-> 50% of the NIC's stop responding after some time !
They can be recovered by having the server create some network
traffic by itself (ping something).
I have observed this with on-board NIC's and also with PCI NIC's
from IBM, under NT, Win95/98, Linux, and even DOS ... so don't
tell me it's a driver issue.
Regards,
Walter
>I have been in contact with these NICs in various forms under NT for
>quite some time and I have never seen any errors or problems there. I
>also have never seen any problems under Linux but my server is a
>Quake2/Quake3 server with some ftp/http. I'm on a half duplex 10BaseT
>link as well. Am assuming my bandwidth hasn't reached a critical level
>or it is full duplex that gives the NIC fits. My Machine is a Dual
>PII400 (Gigabyte MB) running redhat 6.0, kernel 2.2.13 and ver 1.06 of
>the NIC driver.
>My guess is that Intel has 1) worked around issues with the chipset in
>their windows drivers to hide design problems and/or 2) Not released
>complete specs on the board and this is causing problems.
>If I knew what I was doing when it came to c or networking drivers I'd
>create a driver that followed Intel's specs 100% and then work off of
>that (Not that Donald has not done this). That way you elimiate any
>deviations from the specs as the culprit. Just my $0.02.
>If anyone has a way for me to test to see if I can crap out my NIC I'd
>be willing to do that and feed the results back to the list.
>
>
>Scott
>
>
>----- Original Message -----
>From: "yhersch" <yhersch@allot.com>
>To: <linux-eepro100@beowulf.gsfc.nasa.gov>
>Sent: Wednesday, September 08, 1999 7:09 AM
>Subject: Some comments...
>
>
>> Hi,
>>
>> I've been following the various discussions concerning the operation
>(or
>> inoperation?) of the eepro100. Until now I haven't had much to
>contribute.
>> However, things got hairy and I had no choice but to figure out what's
>> going on. Some observations...
>>
>> 1) My feeling (OK, this isn't an observation) all along has been that
>the
>> Intel chip itself has some basic flaw. It seems to get confused and
>there
>> is no way to recover gracefully. I have no proof, but look at the
>topics
>> discussed in this mailing list (receive hangs, transmit timeouts,
>etc). On
>> second thought, maybe this IS an observation.
>>
>> 2) We (Allot Communications) started experiencing crashes when we
>upgraded
>> to a faster system board. I made an assumption (yes, I know what
>ass-u-me
>> means), at least for this exercise (other possibilities of course
>exist)
>> that the problem was timing based. More specifically, the new system
>board
>> is TOO fast, and the NIC can't keep up. This could be caused by an
>improper
>> board design, which doesn't allow certain signals to stabilize
>properly
>> (quickly enough), or it could be a bug in the NIC itself (see #1
>above).
>> Another possibility is that the chip just isn't designed to operate in
>> high-speed systems, and either certain hardware or software design
>changes
>> or workarounds are necessary. Workarounds make me nervous - they often
>> translate into reduced performance.
>>
>> 3) So, I got my hands dirty and started mucking around with the
>driver.
>> Most of my experiments involved various delays and code shuffling in
>the
>> driver's interrupt routine. Yeah, you all read correctly, delays in an
>> interrupt routine - If any of my computer science instructors were
>dead
>> today they'd be rolling in their graves. Of interest:
>> ==> The proper delay inserted between reading the interrupt status and
>> acking the interrupts (writing back to the same register) keeps the
>board
>> from crashing. The size of the delay is particularly sensitive - if
>too
>> low, the system crashes; if too high, the ISR is overworked.
>Performance
>> results were varied based on different delay values.
>> Acking the interrupts twice (two sequential writes to the status
>register)
>> also kept the system from crashing, however performance suffered
>> significantly.
>> I was unsuccessful in my attempts at removing the delay by shuffling
>the
>> code around. The system continued to crash. More research and
>> experimentation is necessary to find another solution to the delay. In
>my
>> opinion, adding a delay is an evil workaround due to faulty hardware
>> behavior and it will negatively affect performance.
>>
>> 4) I discovered some potential problems with the driver itself. The
>Intel
>> User's Guide clearly RECOMMENDS that all accesses to the command and
>status
>> registers be limited to byte-wide access to avoid any side-effects.
>> However, the driver uses only word-wide access to these registers.
>There
>> might be nothing more sinister in this than the fact that Intel is
>> recommending good programming practice. However, I know what it means
>when
>> my wife RECOMMENDS that I tackle some chores around the house. It
>might be
>> that there is in fact a problem with word-wide access, and the driver
>needs
>> to be rewritten, or seriously massaged.
>>
>> 5) The loop in the wait_for_cmd_done() routine might be too short for
>very
>> fast boards. I changed the loop from 100 to 10000. Is this too high,
>or too
>> low? It seems that this keeps the system more stable, but I don't have
>any
>> positive proof (yet).
>>
>> 6) Intel documentation states clearly that the CU Start and RU Start
>should
>> only be executed when the unit is in either the idle or no resources
>state.
>> This is not always checked. For example, in the ISR, the RxStart
>command
>> (RX_START in older drivers) is issued without first invoking
>> wait_for_cmd_done(). It seems to me that unless it's 100% sure that
>the
>> receive unit is idle here, wait_for_cmd_done() should be called. Also
>as I
>> recall, there are one or two other places in the driver where either
>the
>> RxStart or CuStart commands are issued without first invoking
>> wait_for_cmd_done().
>>
> >> 7) The transmit routine has a somewhat lengthy section of code in
>which
>> interrupts are disabled. It seems to me that perhaps it would be
>worthwhile
>> seeing if there is a way to redesign this area to eliminate (or at
>least
>> shorten the duration of) the interrupts being disabled.
>>
>>
>> Using version 1.05 of the driver, I was able to come up with a stable
>> working version of the driver. This was accomplished by doing the
>> following:
>> - In the speedo_interrupt() routine, I added a delay - udelay(2) -
>right
>> after reading the interrupt status.
>> - Changed the wait_for_cmd_done() loop to 10000.
>> - Made sure that wait_for_cmd_done() was invoked every place that the
>> RxStart or CuStart commands are issued.
>>
>> I hope that I've contributed some useful ideas and haven't just
>waisted
>> mailing list bandwidth. I'm continuing my experiments and maybe
>something
>> will come of all this. I'll keep you all posted.
>>
>> Thanks of course goes to Donald Becker. Along with Daniel Veillard, I
>too
>> find it amazing that just about every NIC driver has Donald's name as
>the
>> author. Doesn't the guy ever sleep?!
>>
>> Regards,
>>
>> Yisrael (Russ) Hersch
>> Allot Communications
>> yhersch@allot.com
>>
>>
-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-eepro100-request@beowulf.org