--------------=_4D4800E9C938450574C8 Content-Description: filename="text1.txt" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I have just finished reading the archives re: what appears to be a=20 rather frustrating issue (Transmitter Timeout). The fact that I was=20 reading the archive should be clue enough it has raised its head here=20 as well. I wanted to pass on some info to the list and see if it helps=20 any of you working the issue. We have two 4 node linux clusters built from Dell power edge=20 2300/2400's, dual 500mhz, PERCII/SC raid, 1GB ram, and 2 82557's in=20 every box. The first cluster in the US has NEVER seen the timeout=20 problem and has been operational for over a year now. However, our =20 most recent deployment of an identical cluster in Asia is seeing it on=20 a regular basis. All systems currently have an in-house compiled=20 2.2.14 smp kernel and are using eepro100.c v1.06.as a loadable=20 module. These have been diff'ed many times to verify all the is same=20 everywhere. The two main differences I have identified are: First,the working cluster talks to Cisco gear while the other talks to=20 3com gear. To get everything working properly in the states (cisco=20 gear) we are disabling auto-negotiation and forcing 100mbit-FD=20 (options=3D0x30,0x30 in conf.modules) We are doing the same in Asia, but= =20 this does not appear to be helping. Second, the hardware in the states is slightly older, Dell 2300's=20 (32-bit PCI) backplane, while the hardware in Asia is the newer.Dell=20 2400's which have both 32bit and 64bit PCI slots. The Intel 82557's=20 are in the 32bit slots.=20 The interesting thing is eth0 (which goes to a 3com switch and then=20 into the core) has never had the problem in Asia. While eth1 that goes=20 directory to the core and is configured as a private vlan for=20 inter-box communication is seeing the problem (Note: I am completely=20 familiar with the details of this configuration, I am repeating what=20 the networking guys have said). My biggest problem is I have not been able to find a sufficient=20 workaround. Ifup/down does basically nothing. The TX error counters=20 continue to show the same error count after the interface is=20 re-enabled. Also, I cant very easily rmmod since that would require me=20 to down both interfaces under script contol, this makes me slightly=20 nervous since the console is about 7000 miles away from here. If anyone has any suggestions as to what I should try, what additional=20 information might be helpful, etc, it would be most appreciated. I am=20 supposed to turn this on live in a week. Considering the private vlan=20 (eth1) is the core of the inter-box communication (see=20 http://www.linuxvirtualserver.org ) and nfs mounting, I am pretty=20 much screwed if this can not be made to work like things here in the=20 US. Thanks in advance and I apologize for the excessive length but I=20 wanted to cover as much as possible in one place. Thanks again.=20 Paul Walker =20 --------------=_4D4800E9C938450574C8 Content-Description: filename="text1.html" Content-Type: text/html Content-Transfer-Encoding: quoted-printable <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">Transmitter Timeout I have just finished reading the archives re: what appears to be a rather frustrating issue (Transmitter Timeout). The fact that I was reading the archive should be clue enough it has raised its head here as well. I wanted to pass on some info to the list and see if it helps any of you working the issue.
We have two 4 node linux clusters built from Dell power edge 2300/2400's, dual 500mhz, PERCII/SC raid, 1GB ram, and 2 82557's in every box. The first cluster in the US has NEVER seen the timeout problem and has been operational for over a year now. However, our=20 most recent deployment of an identical cluster in Asia is seeing it on a regular basis. All systems currently have an in-house compiled 2.2.14 smp kernel and are using eepro100.c v1.06.as a loadable module. These have been diff'ed many times to verify all the is same everywhere.
The two main differences I have identified are:
First,the working cluster talks to Cisco gear while the other talks to 3com gear. To get everything working properly in the states (cisco gear) we are disabling auto-negotiation and forcing 100mbit-FD (options=3D0x30,0x30 in conf.modules) We are doing the same in Asia, but= this does not appear to be helping.
Second, the hardware in the states is slightly older, Dell 2300's (32-bit PCI) backplane, while the hardware in Asia is the newer.Dell 2400's which have both 32bit and 64bit PCI slots. The Intel 82557's are in the 32bit slots.=20
The interesting thing is eth0 (which goes to a 3com switch and then into the core) has never had the problem in Asia. While eth1 that goes directory to the core and is configured as a private vlan for inter-box communication is seeing the problem (Note: I am completely familiar with the details of this configuration, I am repeating what the networking guys have said).
My biggest problem is I have not been able to find a sufficient workaround. Ifup/down does basically nothing. The TX error counters continue to show the same error count after the interface is re-enabled. Also, I cant very easily rmmod since that would require me to down both interfaces under script contol, this makes me slightly nervous since the console is about 7000 miles away from here.
If anyone has any suggestions as to what I should try, what additional information might be helpful, etc, it would be most appreciated. I am supposed to turn this on live in a week. Considering the private vlan (eth1) is the core of the inter-box communication (see http://www.linuxvirtualse= rver.org ) and nfs mounting, I am pretty much screwed if this can not be made to work like things here in the US.
Thanks in advance and I apologize for the excessive length but I wanted to cover as much as possible in one place.
Thanks again.=20
Paul Walker =20
--------------=_4D4800E9C938450574C8--