[Beowulf] SATA II - PXE+NFS - diskless compute nodes
Simon Kelley
simon at thekelleys.org.uk
Thu Dec 14 15:01:47 PST 2006
Donald Becker wrote:
>>
>>I'm not quite following here: It seems like you might be advocating
>>retransmits every half second. I'm current doing classical exponential
>>backoff, 1 second delay, then two, then four etc. Will that bite me?
>
>
> Where are you you doing exponential back-off?
re-transmits in the TFTP server: sent a block and await the
corresponding ACK; if it doesn't arrive for timeout, re-send. This is
needed to recover from lost data packets, client retries only recover
from lost ACKs (at least they do in implementations which have been
immunised against sorcerers-apprentice syndrome.)
> The TFTP client will/should/might do a retry every second. (Background:
> TFTP uses "ACK" of the previous packet to mean "send the next one". The
> only way to detect this is a retry is timing.) The client might do a
> re-ARP first. In corner cases it might not reply to ARP itself.
>
> [[ Step up on the soapbox. ]]
>
> What idiot thought that exponential backoff was a good idea?
> Exponential backoff doesn't make sense where your base time period is a
> whole second and you can't tell if the reason for no response is
> failure, busy network or no one listening.
>
> My guess is that they were just copying Ethernet, where modified,
> randomized exponential backoff is what makes it magically good.
> Exponential backoff makes sense at the microsecond level, where you have
> a collision domain and potentially 10,000 hosts on a shared ether. Even
> there the idea of "carrier sense" or 'is the network busy' is what
> enables Ethernet to work at 98+% utilization rather than the 18% or 37%
> theoretical of Aloha Net. (Key difference: deaf transmitter.)
>
> What usually happens with DHCP and PXE is that the first packet is used
> getting the NIC to transmit correctly. The second packet is used to get
> the switch to start passing traffic. The third packet get through but we
> are already well into the exponential fallback.
>
> PXE would be much better and more reliable if it started out
> transmitting a burst of four DHCP packets even spaced in the first
> second, then falling back to once per second. If there is a concern
> about DHCP being a high percentage of traffic in huge installations
> running 10baseT, tell them to buy a server. Or, like, you know, a
> router. Because later the ARP traffic alone will dwarf a few DHCP
> broadcasts.
It's probably worth differentiating DHCP and TFTP here. I guess the
reason for exponential-backoff of to avoid congestion-collapse as the
ratio of bits-on-the-wire to useful work decreases. By the time a host
is doing TFTP the network-path should be established, so bursting
packets shouldn't be needed. Maybe delaying backoff would make sense.
>
>
>>I'm doing round-robin, but I don't see how to throttle active
>>connections: do I need to do that, or just limit total bandwidth?
>
>
> Yes, you need to throttle active TFTP connections. The clients
> currently winning can turn around a next-packet request really quickly.
> If a few get in lock step, the server will have the next chunk of the
> file warm in the cache. This is the start of locking out the first
> loser.
>
> You can't just let the ACKs queue up in the socket as a substitute for
> deferring responses either. You have to pull them out ASAP and mark
> that client as needing a response. This doesn't cost very much. You
> need to keep the client state structure anyway. This is just one more
> bit, plus updating the timeval that you should be keeping anyway.
>
All true. I'll experiment with some throttling approaches.
Cheers,
Simon.
>
More information about the Beowulf
mailing list