[Beowulf] [EXTERNAL] Re: IB vs. Ethernet
Lux, Jim (US 430E)
james.p.lux at jpl.nasa.gov
Thu Feb 26 00:20:52 UTC 2026
-----Original Message-----
From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Lawrence Stewart
Sent: Saturday, February 21, 2026 4:34 AM
To: beowulf at beowulf.org
Cc: Lawrence Stewart <stewart at serissa.com>
Subject: [EXTERNAL] Re: [Beowulf] IB vs. Ethernet
> On Feb 21, 2026, at 3:28 AM, Greg Lindahl <lindahl at pbm.com> wrote:
>
> On Thu, Jan 15, 2026 at 08:28:36PM -0500, Lawrence Stewart wrote:
>
>> I think a 64 byte store at a core should directly become a packet. No on-die-network, no coherence, no root complex, no host-fabric adapter. Incoming short messages should be delivered directly to a fifo in the relevant core.
>
> I think that's a great idea!
>
> — greg
>
As Greg, I think, is hinting, this idea was a thing that QLogic HFI’s did, using the core write combining buffers to good effect. It seems like it is also the basic idea behind MOVDIR64B, which specifies that a 64 byte write will be atomic all the way down.
Using core registers for messaging is much older, with Transputers, Tilera, Dally’s J Machine and arguably Cray E-registers.
What this is really about is end to end latency. We’ve been stuck at 1 microsecond since the Cray T3D 30 years ago, in spite of 100x improvements in link speed. If we can eliminate all the middlemen and get switches back to 50 ns forwarding, I think we should be able to get 300 ns end to end in a good size system.
-Larry
Indeed, I suspect the 1 microsecond probably ties to something else that was convenient - If you're not running parallel wires (lanes) then sending 1000 bits at 1Gbps takes 1 microsecond.
And if the actual link gets faster, the messages get bigger, so that they still take 1 microsecond.
There are some practical issues - As your symbol rate gets higher on the wire, things like impedance discontinuities causing reflections become more important. You have a transition from die to package, one from package to board, one from board to connector/cable. And those all have ~1-10 ns kind of time scales. Stack all those up and it can take a long time for the cascade of reflections to die out.
The fix, today, is to put equalizers (preferably adaptive equalizers) that essentially "undistort" the waveform. But those equalizers have to look at many symbol times to work (typically, they're implemented as a tapped delay line with weights on each tap and summed - a FIR filter), which then means that your first bit out is delayed by however many symbols are in the filter's delay line. I suspect that for "commodity" hardware, there's a particular length of delay line that is long enough to accommodate all possible wiring configurations.
Let's look at Ethernet - the maximum ethernet run for GigE is 100 meters, which not so oddly, is about 500 ns long (propagation speed is ~0.66c due to the dielectric and capacitance/inductance of the twisted pair). So the time for a reflection to get back to the sending end is, hmmm, 1 microsecond.
More information about the Beowulf
mailing list