[Beowulf] How Can Microsoft's HPC Server Succeed?

Wed Apr 9 14:45:01 PDT 2008

On Sat, 5 Apr 2008, Anand Vaidya wrote:

> On Fri, Apr 4, 2008 at 6:19 PM, Geoff Galitz <geoff at galitz.org> wrote:
> 
> Having said that, I think that the Linux clustering scene needs a little
> competition, especially the for-fee ones. Apart from SDSC, not many
> innovations are happening.

> I am not referring to standalone projects, where
> FOSS community has a lot of innovatio happening,  but rather one integrated
> Linux Cluster  on a DVD that gets you a cluster ready in 20mins, with no
> pain at all. ROCKS comes with its own problems, esp, wrt updates (which is
> why I stopped using ROCKS), however they are working on this one, AFAIK.

I delayed responding to this, since I expected that someone else talk 
about it.  (Joe did, but only a little.)

Scyld published the first "cluster on a disk" in late 2000.  It was a 
single install disk that asked two or three extra questions over a 
standard Linux install, installed in about the same time as the underlying 
distribution (essentially RedHat, so about 20 minutes) and could boot 
about 500 slave nodes in about a minute over Fast Ethernet.(1)

Two years later our demo version was a live CD, so it zero install on the 
master as well.  A single live CD could boot a 1000 node cluster and run 
one of a few toy apps.

A sad thing for me is that can no longer publish a similar CD.  We were 
heavily marketed against for being an integrated system, even to the point 
of implying Scyld wasn't really Linux.  In the end we had to change how we 
deliver our system to make it clear.  Today we have a two step install, 
starting with a generic Linux distribution (typically CentOS or RHEL) and 
later adding our packages.  With this packaging we can longer have a live 
CD that acts the same as the installed version.

(1) A drawback of machines in that era was that they didn't have network 
booting built in.  We had to invent our own network booting system, 
BeoBoot, and have it support every possible network adapter.  
Operationally it was a PITA since it required every node too first boot 
off of a floppy, CD, flash or tiny hard disk partition. So you first 
had to write/burn a bunch of floppy/CD-R disks, and reading them 
delayed booting so that it was more like 2 minutes to boot 100 
slave nodes.

We put a bunch of effort into making this boot process admin-free.  The 
BeoBoot system uses a stable kernel to download the operational version 
from the master.  This both makes updating the kernel a single-point 
effort, as well as eliminating the risk  of making the whole cluster 
unbootable with a flawed update. 

> So, here's what the FOSS community, especially,  vendors (RH, Novell) should
> be doing, specifically for a HPC oriented version:
> 
> - remove all unwanted packaged (desktop software, multimedia, web browsers
> etc)

We have a better way, driven by long experience.  Don't go through 
the error-prone process of figuring out a minimal system.  (Modern RPM 
systems will pull in almost everything anyway, yet still omit a critical 
tool.)  Instead to a full, standard install and configuration on the 
master and use it as your reference.  

For the compute nodes start from zero, and build from there. First, use 
the network boot system to figure out what kernel they should run, and 
have the master pass it that kernel plus the network driver.  Then the 
master asks the node what hardware it has, and uses its local 
configuration to figure out what kernel modules plus configuration info 
are needed to support.

Then whenever you start a job on the compute node, verify that it has the 
currently correct version of the executable and libraries.  If it doesn't 
(and "I got nothin" is the same as having the wrong version), copy it 
over.  Don't page it in, which results in unpredictable performance.  Just 
do a single transfer and cache the whole executable/library to 
linear memory.  It's the application you are about to run, and with a zero 
install the node is only running compute applications, so you won't be 
wasting memory.

 > - package SGE, Ganglia,

Pretty much a given... you need at least a mapper-scheduler and a 
monitoring system.  You can do slightly better than these, but it's easier 
to do much worse.

> - a good clustering toolkit, maybe derived from ROCKS scripts (I am biased
> towards IBM xcat, because that is the only tool I use)

Why point to Rocks as an example?  Like so many other "cluster
systems" it's a non-architecture, an ad hoc system.  It's a packaging and
support exercise, not innovation.  It's a giant step back to the Windows
world when simple administration, such as adding new nodes, is done by
re-installation.

> - LDAP as the default auth source, setup SSH for clusterwide passwordless
> logins by default

Both high-cost, sub-optimal choices for normal operation.

We implement a cluster-specific name service that handles most name
queries very quickly, and pointedly without network transactions.  We fall 
back to other services only when the application asks about external 
things e.g. other users or non-cluster hosts.  (We recently added our own 
network-fallback service so that the master can resolve these without 
configuring NIS/LDAP/AD on compute nodes.)

Standard 'ssh' is slow to start jobs, and not precise about the 
environment and executables.  You solve the first problem by building 
persistent network connections between the head and compute nodes, 
authenticating only once.

> - package a selection of top20 FLOSS science apps (Gromacs, Phylip, Blast,
> MPICH,  fasta, fftw etc)

Libraries are mostly easy.  We automate what we can, but we have learned 
that the interesting apps require human configuration or tuning.

> - package and provide one click installation for restricted-ware such as
> NAMD, or commercial software such as Intel Compilers, Fluent, Amber etc. It
> CAN be done, Ubuntu has demonstrated how to do it well.

We've done this by providing demo-license versions where possible, such as 
with the Intel compilers.  But most HPC ISVs don't have the resources to 
be flexible.  I don't see any HPC distribution+app installation being as 
easy as Ubuntu for at least a few years.  Even if we jump up and down and 
point, screaming "It's easy.  They do it.  They even show to do it."

> - package and provide easy install of a parallel filesystem such as GFS or
> Lustre

We shipped integrated PVFS starting with our second release, including 
funding the PVFS guys to make it easy to configure.  Over the years we 
have included a few others, but preconfigured support for advanced
distributed and cluster file systems hasn't justified the effort and cost.  
We now sometimes include the kernel modules, but configuration is done 
as a professional service or by customers that are already experts.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA