[Beowulf] Fault tolerance & scaling up clusters (was Re: Bright Cluster Manager)

Thu May 17 07:16:08 PDT 2018

Roland, the OpenHPC integration IS interesting.
I am on the OpenHPC list and look forward to the announcement there.

On 17 May 2018 at 15:00, Roland Fehrenbacher <rf at q-leap.de> wrote:

> >>>>> "J" == Lux, Jim (337K) <james.p.lux at jpl.nasa.gov> writes:
>
>     J> The reason I hadn't looked at "diskless boot from a
>     J> server" is the size of the image - assume you don't have a high
>     J> bandwidth or reliable link.
>
> This is not something to worry about with Qlustar. A (compressed)
> Qlustar 10.0 image containing e.g. the core OS + slurm + OFED + Lustre is
> just a mere 165MB to be transferred (eating 420MB of RAM when unpacked
> as the OS  on the node) from the head to a node. Qlustar (and its
> non-public ancestors) were never using anything but RAMDisks (with real
> disks for scratch), the first cluster running this at the end of 2001 was
> on
> Athlons ... and eaten-up RAM in the range of 100MB still mattered a lot
> at that time :)
>
> So over the years, we perfected our image build mechanism to achieve a
> close to minimal (size-wise) OS, minimal in the sense of: Given required
> functionality (wanted kernel modules, services, binaries/scripts, libs),
> generate an image (module) of minimal size providing it. That is maximum
> light-weight by definition.
>
> Yes, I know, you'll probably say "well, but it's just Ubuntu ...". Not for
> much longer though: CentOS support (incl. OpenHPC integration) coming
> very soon ... And all Open-Source and free.
>
> Best,
>
> Roland
>
> -------
> https://www.q-leap.com / https://qlustar.com
>           --- HPC / Storage / Cloud Linux Cluster OS ---
>
>     J> On 5/12/18, 12:33 AM, "Beowulf on behalf of Chris Samuel"
>     J> <beowulf-bounces at beowulf.org on behalf of chris at csamuel.org>
>     J> wrote:
>
>     J>     On Wednesday, 9 May 2018 2:34:11 AM AEST Lux, Jim (337K)
>     J>     wrote:
>
>     >> While I’d never claim my pack of beagles is HPC, it does share
>     >> some aspects – there’s parallel work going on, the nodes need to
>     >> be aware of each other and synchronize their behavior (that is,
>     >> it’s not an embarrassingly parallel task that’s farmed out from a
>     >> queue), and most importantly, the management has to be scalable.
>     >> While I might have 4 beagles on the bench right now – the idea is
>     >> to scale the approach to hundreds.  Typing “sudo apt-get install
>     >> tbd-package” on 4 nodes sequentially might be ok (although pdsh
>     >> and csshx help a lot) it’s not viable for 100 nodes.
>
>     J>     At ${JOB-1) we moved to diskless nodes and booting RAMdisk
>     J>     images from the management node back in 2013 and it worked
>     J>     really well for us.  You no longer have the issue about nodes
>     J>     getting out of step because one of them was down when you ran
>     J>     your install of a package across the cluster, removed HDD
>     J>     failures from the picture (though that's likely less an issue
>     J>     with SSDs these days) and did I mention the peace of mind of
>     J>     knowing everything is the same?  :-)
>
>     J>     It's not new, the Blue Gene systems we had (BG/P 2010-2012
>     J>     and BG/Q 2012-2016) booted RAMdisks as they were designed to
>     J>     scale up to huge systems from the beginning and to try and
>     J>     remove as many points of failure as possible - no moving
>     J>     parts on the node cards, no local storage, no local state,
>
>     J>     Where I am now we're pretty much the same, except instead of
>     J>     booting a pure RAM disk we boot an initrd that pivots onto an
>     J>     image stored on our Lustre filesystem instead.  These nodes
>     J>     do have local SSDs for local scratch, but again no real local
>     J>     state.
>
>     J>     I think the place where this is going to get hard is on the
>     J>     application side of things, there were things like
>     J>     Fault-Tolerant MPI (which got subsumed into Open-MPI) but it
>     J>     still relies on the applications being written to use and
>     J>     cope with that.  Slurm includes fault tolerance support too,
>     J>     in that you can request an allocation and should a node fail
>     J>     you can have "hot-spare" nodes replace the dead node but
>     J>     again your application needs to be able to cope with it!
>
>     J>     It's a fascinating subject, and the exascale folks have been
>     J>     talking about it for a while - LLNL's Dona Crawford keynote
>     J>     was about it at the Slurm User Group in 2013 and is well
>     J>     worth a read.
>
>     J>     https://slurm.schedmd.com/SUG13/keynote.pdf
>
>     J>     Slide 21 talks about the reliability/recovery side of things:
>
>     J>     # Mean time between failures of minutes or seconds for
>     J>     # exascale
>     J>     [...]
>     J>     # Need 100X improvement in MTTI so that applications can run
>     J>     # for many hours. Goal is 10X improvement in hardware
>     J>     # reliability. Local recovery and migration may yield another
>     J>     # 10X. However, for exascale, applications will need to be
>     J>     # fault resilient
>
>     J>     She also made the point that checkpoint/restart doesn't
>     J>     scale, you will likely end up spending all your compute time
>     J>     doing C/R at exascale due to failures and never actually
>     J>     getting any work done.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20180517/37606324/attachment-0001.html>