[Beowulf] integrating node disks into a cluster filesystem?
Mark Hahn
hahn at mcmaster.ca
Fri Sep 25 15:09:11 PDT 2009
Hi all,
I'm sure you've noticed that disks are incredibly cheap, obscenely large and
remarkably fast (at least in bandwidth). the "cheap" part is the only one
of these that's really an issue, since the question becomes: how to keep
storage infrastructure cost (overhead) dominating the system cost?
the backblaze people took a great swing at this - their solution is really
centered on the 5-disk port-multiplier backplanes. (I would love to hear
from anyone who has experience with PM's, btw.)
but since 1U nodes are still the most common HPC building block, and most
of them support 4 LFF SATA disks with very little added cost (esp using the
chipset's integrated controller), is there a way to integrate them into a
whole-cluster filesystem?
- obviously want to minimize the interference of remote IO to a node's jobs.
for serial jobs, this is almost moot. for loosely-coupled parallel jobs
(whether threaded or cross-node), this is probably non-critical. even for
tight-coupled jobs, perhaps it would be enough to reserve a core for
admin/filesystem overhead.
- iscsi/ataoe approach: export the local disks via a low-level block protocol
and raid them together on dedicated fileserving node(s). not only does
this address the probability of node failure, but a block protocol might
be simple enough to avoid deadlock (ie, job does IO, allocating memory for
pagecache them network packets, which may by chance wind up triggering
network activity back to the same node, and more allocations for the
underlying disk IO.)
- distributed filesystem (ceph? gluster? please post any experience!) I
know it's possible to run oss+ost services on a lustre client, but not
recommended because of the deadlock issue.
- this is certainly related to more focused systems like google/mapreduce.
but I'm mainly looking for more general-purpose clusters - the space would
be used for normal files, and definitely mixed read/write with something
close to normal POSIX semantics...
thanks, mark hahn.
More information about the Beowulf
mailing list