Problems with tulip 0.89K p-to-p and simplex link questions
Sam Mosel
samm@vsl.com.au
Thu Sep 24 09:31:30 1998
Greetings,
Here's something to get your teeth into...
I have 2 machines direct connected via secondary NICs. The NICs are
Digital DE-500-FA 100Mbps Fibre cards. They have a separate fibre for
Tx & Rx, and I have them crossed over. These are of course tulip cards.
Both machines running RedHat Linux 5.1, kernel version 2.1.119, latest
updates to glibc and other important stuff. Tulip driver 0.89K
compiled as a module. Primary NICs in both are 3Com 3c905 (NOT 3c905B).
The tulips are not operating correctly. Here are some details.
Start with this:
# insmod /usr/src/linux/drivers/net/tulip-0.89K.o options=8 debug=6
# ifup eth1
Both the link & data lights light up for about 10 seconds, then I get
this on the console:
eth1: The transmitter stopped! CSR5 is f0678006, CSR6 b3862002.
I then waited a minute or so, and dmesg output is this:
Found Digital DS21143 Tulip at PCI I/O address 0xb800.
tulip.c:v0.89K 8/8/98 becker@cesdis.gsfc.nasa.gov
eth1: Digital DS21143 Tulip at 0xb800, 00 00 f8 08 a8 bb, IRQ 11.
read_eeprom:
1011 500f 0000 0000 0000 0000 0000 0000
0049 0103 0000 08f8 bba8 4100 4400 3545
3030 462d 2341 0008 0000 0000 0000 0000
ac00 00ac 0000 0000 0000 0000 0000 0000
0700 0200 0488 af07 0508 2100 8880 0804
08af 0005 8021 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 d6e0
eth1: EEPROM default media type 100baseFx.
eth1: Index #0 - Media 100baseFx (#7) described by a 21143 SYM PHY (4) block.
eth1: Index #1 - Media 100baseFx-FD (#8) described by a 21143 SYM PHY (4) block.
eth1: Checking for MII transceivers...
eth1: tulip_open() irq 11.
eth1: Using user-specified media 100baseFx-FD.
eth1: 21143 non-MII 100baseFx-FD transceiver control 08af/0005.
eth1: Using media type 100baseFx-FD, CSR12 is c6.
eth1: Done tulip_open(), CSR0 ffa04800, CSR5 f0360000 CSR6 b2862202. <-- note 1
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
eth1: 21143 negotiation status 000000c6, 100baseFx-FD. <-- note 2
eth1: 21143 negotiation failed, status 000000c6.
eth1: Testing new 21143 media 100baseTx. <-- note 3
eth1: interrupt csr5=0xf0678006 new csr5=0xf0660000.
eth1: The transmitter stopped! CSR5 is f0678006, CSR6 b3862002. <-- note 4
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
eth1: 21143 negotiation status 000000c6, 100baseTx. <-- note 5
eth1: interrupt csr5=0xf0008102 new csr5=0xf0000000.
eth1: The transmitter stopped! CSR5 is f0008102, CSR6 b2420200.
eth1: interrupt csr5=0xf0670004 new csr5=0xf0660000.
eth1: interrupt csr5=0xf0660000 new csr5=0xf0660000.
eth1: exiting interrupt, csr5=0xf0660000.
Notes:
1.
CSR0 looks okay (CAL==8, PBC==8)
CSR5 also looks okay (TS==running, RS==running)
CSR6 looks mostly okay, PCS set (no MII, as expected), ST & SR set
but RA is also set - doesn't this mean promiscuous mode?
2.
status==CSR12, ANS disabled, as expected since NWay failed,
LS10 & LS 100 failed as expected, but these are the
hard-wired twisted-pair values, is there an equivalent
for Fx ?
Should NWay work with a point-to-point configuration like this (tulip
<-> tulip direct)?
3.
I am assuming that the else clause on line 1830 is passing and the
code on lines 1832 - 1837 is being executed. Line 1835 I see disables
NWay as expected given that it has failed. It then hard-codes
100baseTx and writes CSR6 as set on line 1832 and CSR12 to reset the
port activity indicators. Should line 1833 actually be:
dev->if_port = saved_port ;
where saved_port is calculated in the eeprom parse and media select
routines (in my case, saved_port = (void *)ee_data + ee_data[27] from
line 971 masked down to 4 LSBs from line 1401).
If this is the case then line 1835 should probably reflect the
possibility that 100base-Fx is to be tested for. See below for a
question about additional bits in this register to deal with
additional media types.
4.
Yes, the transmitter has stopped. TS==suspended, RS==running, AIS &
TPS are set as expected, but why is TU set (i.e. Transmit buffer
unavailable)?
5.
Looks like we're trying 100baseTx again. I thought NWay was disabled
in line 1835 (see note 3, above). Why are we in this function again?
------
Take the interface down, console output as follows:
# ifdown eth1
Rx ring c0d69810: 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000 80000000
Tx ring c0d69a10: 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 7fffffff 00000000 00000000 00000000 00000000 00000000
Tulip-diag output is then:
# tulip-diag -aem -p 0xb800
tulip-diag.c:v1.05 8/28/98 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Digital DC21040 Tulip Tulip chip registers at 0xb800:
ffa04800 ffffffff ffffffff 00d69810 00d69a10 f0000102 b3860000 f3fe0000
e0000000 fff483ff ffffffff fffe0000 000020c6 ffff0001 fffbffff 8ff00008
The Rx process state is 'Stopped'.
The Tx process state is 'Stopped'.
Transmit stopped, Receive stopped, half-duplex.
The transmit threshold is 128.
Port selection is 100mbps-SYM/PCS 100baseTx scrambler, half-duplex.
***WARNING***: No MII transceivers found!
There's a big pause (5 secs) before the ***WARNING***
I notice that the NIC is in the mode specified by new_csr6 set in line
1832 (see note 3, above). I assume it has fallen back to this due to
NWay failure.
rmmod the driver, then insmod it again as before:
# rmmod tulip-0.89K
# insmod tulip-0.89K.o options=8 debug=6
Now look at the tulip-diag output:
# tulip-diag -aem -p 0xb800
tulip-diag.c:v1.05 8/28/98 Donald Becker (becker@cesdis.gsfc.nasa.gov)
Digital DC21040 Tulip Tulip chip registers at 0xb800:
ffa04800 ffffffff ffffffff 00d69810 00d69a10 f0000102 b2420200 f3fe0000
e0000000 fff583ff ffffffff fffe0000 000020c6 ffff0001 fffbffff 8ff00008
The Rx process state is 'Stopped'.
The Tx process state is 'Stopped'.
Transmit stopped, Receive stopped, full-duplex.
The transmit threshold is 72.
Port selection is 10mpbs-serial, full-duplex.
***WARNING***: No MII transceivers found!
Again the pause before the warning.
Port selection is 10mbps-serial ???? What is going on here?
The driver version 0.83 and a later version (0.89F?) works somewhat,
but I thought it better to use a later version on the assumption that
it would be more stable, better tested and therefore less buggy. The
0.83 and 0.89F? drivers regularly lock up on me (sorry, I haven't done
a full analysis on the failure mode, but I can investigate if it will
help).
------
Now the real reason I have the machines direct-connected:
I want to use it in simplex mode (i.e. implement a "Data Diode" type
arrangement with no reverse channel for data - this is an absolute
requirement that cannot be changed). I have looked through the older
driver code and made some changes, none of which seem to allow the
transmitter to operate without a carrier on the Rx. I get Carrier
Errors (surprise, surprise) whenever I try to transmit a packet.
The current solution is to utilise _another_ tulip Fx NIC (eth2) on
the transmitter machine to simply provide a carrier to the real Tx NIC
(eth1). These are not cheap NICs, (~AU$350) so I'd really like to
remove the carrier NIC if possible.
This problem can be tackled only when I have solved the above problem
of actually getting 2 NICs to talk to each other (problem as detailed
above).
I made the below changes to an earlier version (0.89F, I believe) but
will reference the line numbers to 0.89K to save digging up the old
code.
Firstly, I tried changing the initialisation array (line 309 & 310 in
0.89K) to reference tulip_timer and set up CSR7 to be 0x001ebef as in
the 21040/41 case. My assumption here is that this will disable NWay
autoneg which will likely fail with a simplex link. This made no
difference and the transmitter still failed (carrier errors reported on the
ifconfig and tulip debug output).
After that I also tried clearing LTE in CSR14 on line 1835 (although
in retrospect I realise this is probably insufficient, see below)
What I believe I have to do is some combination of the following:
1. Disable link failure interrupts.
This should be achieveable by clearing CSR7:12, but this is already
done in the above initialiser 0x001ebef.
2. Force selection of 100Base-Fx media without checking for link beat.
Change line 1833 to:
new_csr6 = 7 or 8 for FD (I can hard-code for my project)
Change line 1835 to something which will force 100base-Fx - my copy
of the Reference Manual doesn't contain the appropriate values ( I
assume there are some aditionally defined bits above bit 18 in CSR 14 ?)
Also definitely clear bit 12 (Link Test Enable).
Would this be sufficient?
Apologies for the length of this mail.
TIA,
--
Regards,
Sam.
(samm at
vsl dot com
dot au)
Senior Software Engineer,
Vision Abell Pty. Ltd.
http://www.vsl.com.au/abell/
----------------------------------------------------------------------
It is very easy to be blinded to the essential uselessness of them by
the sense of achievement you get from getting them to work at all.
-- The Hitch-Hiker's Guide to the Galaxy, about the
products of the Sirius Cybernetics Corporation.