[Beowulf] network filesystem
Robert Latham
robl at mcs.anl.gov
Tue Mar 6 10:44:10 PST 2007
On Tue, Mar 06, 2007 at 11:09:18AM -0500, Mark Hahn wrote:
> >I would contend that writing to different sections of a file *must* be
> >supported by any file system deployed on a cluster. How else would
> >you get good performance from MPI-IO?
>
> who uses MPI-IO? straight question - I don't believe any of our 1500 users
> do.
Excellent question. Direct users? Probably not very many.
We do find that straight-up MPI-IO isn't a good fit for a lot of
scientific applications. The convienence factor you mentioned is
indeed important. MPI-IO thinks of data as "stream of bytes", while
applications think in terms of "multidimentional typed data" (a slice
of upper atmosphere).
Libraries like Parallel-HDF5 and Parallel-NetCDF bridge the gap and
provide a convienent, familiar API. The app is still using MPI-IO,
just not directly.
> NFS certainly does as well. you just have to know the constraints.
> are you saying you can never get pathological or incorrect results from
> parallel operations on the same file on any of those FS's?
You observe correctly that file systems offer a set of rules on what
to expect from I/O patterns. These consistency semantics are not set
in stone: MPI-IO consistency semantics are more relaxed than POSIX,
yet generally sufficent for parallel scientific applicaitons.
We would consider it a serious bug in PVFS if simultaneous
non-overlapping writes corrupted data.
If the only file system I had access to was NFS, I'd do one file per
process as well.
> starting with the question: "do you have a good reason to be writing in
> parallel to the same file?". I'm not saying the answer is never yes.
>
> I guess I tend to value portability by obscurity-avoidance. not if it makes
> life utter hell, of course, but...
one file per processor falls down on systems like BGL (where even a
small run is 1024 processes, and 128k is not unheard of).
One file per process also robs the higher layers of the I/O software
stack from an opportunity to optimize access patterns. All processes
reading a collumn out of a row-major array is noncontiguous (and
generally slow) in file-per-processor, but can be contiguous in
single-file after applying data shipping or two-phase collective
buffering optimizations.
Jeff touched on the data management issues of file-per-processor.
If file-per-processor really is the most portable and convienent way
to work on data, well, I can't argue with that. On NFS, that's
probably the only way to get correct results. The single-file
approach, however, has significant benefits on the modern parallel
file systems available today.
As I hope you could tell, this kind of discussion is a lot of fun for
me. Thanks!
==rob
--
Rob Latham
Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA B29D F333 664A 4280 315B
More information about the Beowulf
mailing list