[Beowulf] Slection from processor choices; Requesting Giudence
Mark Hahn
hahn at physics.mcmaster.ca
Thu Jun 15 08:39:08 PDT 2006
> > 1. One processor at each of the compute nodes
> > 2. Two processors (on one mother board) at each of the compute nodes
> > 3. Two Processors (each one dual-core processor) (total 4 cores on
> > 4. four processor (on one mother board) at each of the compute nodes.
not considering a 4x2 configuration?
> > Initially, we are deciding to use Gigabit ehternet switch and 1GB of
> >RAM at
> >each node.
that seems like an odd choice. it's not much ram, and gigabit is
extremely slow (relative to alternatives, or in comparison to on-board
memory access.)
> I've heard many times that memory throughput is extremally important
> in CFD and that using of 1 cpu/1 core per node (or 2 single cores
> Opteron having independed memory channels) is in some cases better
> than any sharing of memory bus(es).
I've heard that too - it's a shame someone doesn't simply use the
profiling registers to look at cache hit-rates on these codes...
but I'd be somewhat surprised if modern CFD codes were entirely
mem-bandwidth-dominated, that is, that they wouldn't make some use
of the cache. my very general observation is that it's getting to be
unusual to encounter code which has as "flat" a memory reference
pattern as Stream - just iterating over whole swaths of memory
sequentially. advances such as mesh adaptation, etc tend to make
memory references less sequential (more random, but also touching
fewer overall bytes, and thus possibly more cache-friendly.)
of course, I'm just an armchair CFD'er ;)
in short, it's important not to disregard memory bandwidth, but
6.4 GB/s is quite a bit, and may not be a problem on a dual-core system
where each core has 1MB L2 to itself. especially since 1GB/system
implies that the models are not huge in the first place.
that said, I find that CFDers tend not to aspire to running on large
numbers of processors. so a cluster of 4x2 machines (which aim to
run mostly <= 8p jobs on single nodes) might be very nice. there are
nice side-effects to having fatter nodes, especially if your workload
is not embarassingly parallel.
(we should have terminology to describe other levels of parallel coupling -
"mortifyingly parallel", for instance. I think "shamefully parallel" is
great description of people who wrap serial job in an MPI wrapper
gratuitously, for instance. and how about "immodestly parallel" for coupled
jobs that scale well, but still somewhat sub-linearly?)
More information about the Beowulf
mailing list