[Beowulf] [EXTERNAL] IB vs. Ethernet
Lawrence Stewart
stewart at serissa.com
Thu Feb 26 01:39:14 UTC 2026
Arista has published 10G latency measurements for QSFP based copper and optical links from 1-6 meters
Copper latency looks like about 5 ns per meter while optical is a little slower for short cables and a little faster for long ones.
For 400 GB link modules, apparently you can use “analog” optical transceivers with 20 ns delays plus fiber delay up to 100 meters. You can also use DSP based ones that could be 100 ns
The Optical Analog/Clock and Data Recovery cables are much lower latency than the Active Optical Cables with retimers in them and perhaps equalizers.
For connections within a rack, you can also use Direct Attach Copper, which is just a twinax parallel cable, up to about 5 meters. Or there are Active Electrical Cables with equalizers that are a bit slower.
The price tags for the optical 400G cables are eye-popping.
I realize that most AI work is bandwidth-focussed, and a microsecond is fine, but I have a soft spot for SHMEM 8 byte puts and gets, and there is always a role for Barrier and small AllGathers.
-L
> On Feb 25, 2026, at 19:20, Lux, Jim (US 430E) <james.p.lux at jpl.nasa.gov> wrote:
>
>
>
> -----Original Message-----
> From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Lawrence Stewart
> Sent: Saturday, February 21, 2026 4:34 AM
> To: beowulf at beowulf.org
> Cc: Lawrence Stewart <stewart at serissa.com>
> Subject: [EXTERNAL] Re: [Beowulf] IB vs. Ethernet
>
>
>
>> On Feb 21, 2026, at 3:28 AM, Greg Lindahl <lindahl at pbm.com> wrote:
>>
>> On Thu, Jan 15, 2026 at 08:28:36PM -0500, Lawrence Stewart wrote:
>>
>>> I think a 64 byte store at a core should directly become a packet. No on-die-network, no coherence, no root complex, no host-fabric adapter. Incoming short messages should be delivered directly to a fifo in the relevant core.
>>
>> I think that's a great idea!
>>
>> — greg
>>
>
>
> As Greg, I think, is hinting, this idea was a thing that QLogic HFI’s did, using the core write combining buffers to good effect. It seems like it is also the basic idea behind MOVDIR64B, which specifies that a 64 byte write will be atomic all the way down.
>
> Using core registers for messaging is much older, with Transputers, Tilera, Dally’s J Machine and arguably Cray E-registers.
>
> What this is really about is end to end latency. We’ve been stuck at 1 microsecond since the Cray T3D 30 years ago, in spite of 100x improvements in link speed. If we can eliminate all the middlemen and get switches back to 50 ns forwarding, I think we should be able to get 300 ns end to end in a good size system.
>
> -Larry
>
>
> Indeed, I suspect the 1 microsecond probably ties to something else that was convenient - If you're not running parallel wires (lanes) then sending 1000 bits at 1Gbps takes 1 microsecond.
>
> And if the actual link gets faster, the messages get bigger, so that they still take 1 microsecond.
>
> There are some practical issues - As your symbol rate gets higher on the wire, things like impedance discontinuities causing reflections become more important. You have a transition from die to package, one from package to board, one from board to connector/cable. And those all have ~1-10 ns kind of time scales. Stack all those up and it can take a long time for the cascade of reflections to die out.
>
> The fix, today, is to put equalizers (preferably adaptive equalizers) that essentially "undistort" the waveform. But those equalizers have to look at many symbol times to work (typically, they're implemented as a tapped delay line with weights on each tap and summed - a FIR filter), which then means that your first bit out is delayed by however many symbols are in the filter's delay line. I suspect that for "commodity" hardware, there's a particular length of delay line that is long enough to accommodate all possible wiring configurations.
>
> Let's look at Ethernet - the maximum ethernet run for GigE is 100 meters, which not so oddly, is about 500 ns long (propagation speed is ~0.66c due to the dielectric and capacitance/inductance of the twisted pair). So the time for a reflection to get back to the sending end is, hmmm, 1 microsecond.
>
>
More information about the Beowulf
mailing list