[Beowulf] How Can Microsoft's HPC Server Succeed?
Donald Becker
becker at scyld.com
Wed Apr 9 14:45:01 PDT 2008
On Sat, 5 Apr 2008, Anand Vaidya wrote:
> On Fri, Apr 4, 2008 at 6:19 PM, Geoff Galitz <geoff at galitz.org> wrote:
>
> Having said that, I think that the Linux clustering scene needs a little
> competition, especially the for-fee ones. Apart from SDSC, not many
> innovations are happening.
> I am not referring to standalone projects, where
> FOSS community has a lot of innovatio happening, but rather one integrated
> Linux Cluster on a DVD that gets you a cluster ready in 20mins, with no
> pain at all. ROCKS comes with its own problems, esp, wrt updates (which is
> why I stopped using ROCKS), however they are working on this one, AFAIK.
I delayed responding to this, since I expected that someone else talk
about it. (Joe did, but only a little.)
Scyld published the first "cluster on a disk" in late 2000. It was a
single install disk that asked two or three extra questions over a
standard Linux install, installed in about the same time as the underlying
distribution (essentially RedHat, so about 20 minutes) and could boot
about 500 slave nodes in about a minute over Fast Ethernet.(1)
Two years later our demo version was a live CD, so it zero install on the
master as well. A single live CD could boot a 1000 node cluster and run
one of a few toy apps.
A sad thing for me is that can no longer publish a similar CD. We were
heavily marketed against for being an integrated system, even to the point
of implying Scyld wasn't really Linux. In the end we had to change how we
deliver our system to make it clear. Today we have a two step install,
starting with a generic Linux distribution (typically CentOS or RHEL) and
later adding our packages. With this packaging we can longer have a live
CD that acts the same as the installed version.
(1) A drawback of machines in that era was that they didn't have network
booting built in. We had to invent our own network booting system,
BeoBoot, and have it support every possible network adapter.
Operationally it was a PITA since it required every node too first boot
off of a floppy, CD, flash or tiny hard disk partition. So you first
had to write/burn a bunch of floppy/CD-R disks, and reading them
delayed booting so that it was more like 2 minutes to boot 100
slave nodes.
We put a bunch of effort into making this boot process admin-free. The
BeoBoot system uses a stable kernel to download the operational version
from the master. This both makes updating the kernel a single-point
effort, as well as eliminating the risk of making the whole cluster
unbootable with a flawed update.
> So, here's what the FOSS community, especially, vendors (RH, Novell) should
> be doing, specifically for a HPC oriented version:
>
> - remove all unwanted packaged (desktop software, multimedia, web browsers
> etc)
We have a better way, driven by long experience. Don't go through
the error-prone process of figuring out a minimal system. (Modern RPM
systems will pull in almost everything anyway, yet still omit a critical
tool.) Instead to a full, standard install and configuration on the
master and use it as your reference.
For the compute nodes start from zero, and build from there. First, use
the network boot system to figure out what kernel they should run, and
have the master pass it that kernel plus the network driver. Then the
master asks the node what hardware it has, and uses its local
configuration to figure out what kernel modules plus configuration info
are needed to support.
Then whenever you start a job on the compute node, verify that it has the
currently correct version of the executable and libraries. If it doesn't
(and "I got nothin" is the same as having the wrong version), copy it
over. Don't page it in, which results in unpredictable performance. Just
do a single transfer and cache the whole executable/library to
linear memory. It's the application you are about to run, and with a zero
install the node is only running compute applications, so you won't be
wasting memory.
> - package SGE, Ganglia,
Pretty much a given... you need at least a mapper-scheduler and a
monitoring system. You can do slightly better than these, but it's easier
to do much worse.
> - a good clustering toolkit, maybe derived from ROCKS scripts (I am biased
> towards IBM xcat, because that is the only tool I use)
Why point to Rocks as an example? Like so many other "cluster
systems" it's a non-architecture, an ad hoc system. It's a packaging and
support exercise, not innovation. It's a giant step back to the Windows
world when simple administration, such as adding new nodes, is done by
re-installation.
> - LDAP as the default auth source, setup SSH for clusterwide passwordless
> logins by default
Both high-cost, sub-optimal choices for normal operation.
We implement a cluster-specific name service that handles most name
queries very quickly, and pointedly without network transactions. We fall
back to other services only when the application asks about external
things e.g. other users or non-cluster hosts. (We recently added our own
network-fallback service so that the master can resolve these without
configuring NIS/LDAP/AD on compute nodes.)
Standard 'ssh' is slow to start jobs, and not precise about the
environment and executables. You solve the first problem by building
persistent network connections between the head and compute nodes,
authenticating only once.
> - package a selection of top20 FLOSS science apps (Gromacs, Phylip, Blast,
> MPICH, fasta, fftw etc)
Libraries are mostly easy. We automate what we can, but we have learned
that the interesting apps require human configuration or tuning.
> - package and provide one click installation for restricted-ware such as
> NAMD, or commercial software such as Intel Compilers, Fluent, Amber etc. It
> CAN be done, Ubuntu has demonstrated how to do it well.
We've done this by providing demo-license versions where possible, such as
with the Intel compilers. But most HPC ISVs don't have the resources to
be flexible. I don't see any HPC distribution+app installation being as
easy as Ubuntu for at least a few years. Even if we jump up and down and
point, screaming "It's easy. They do it. They even show to do it."
> - package and provide easy install of a parallel filesystem such as GFS or
> Lustre
We shipped integrated PVFS starting with our second release, including
funding the PVFS guys to make it easy to configure. Over the years we
have included a few others, but preconfigured support for advanced
distributed and cluster file systems hasn't justified the effort and cost.
We now sometimes include the kernel modules, but configuration is done
as a professional service or by customers that are already experts.
--
Donald Becker becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com www.scyld.com
Annapolis MD and San Francisco CA
More information about the Beowulf
mailing list