[Beowulf] network filesystem

Tue Mar 6 10:44:10 PST 2007

On Tue, Mar 06, 2007 at 11:09:18AM -0500, Mark Hahn wrote:
> >I would contend that writing to different sections of a file *must* be
> >supported by any file system deployed on a cluster.  How else would
> >you get good performance from MPI-IO?
> 
> who uses MPI-IO?  straight question - I don't believe any of our 1500 users 
> do.

Excellent question.  Direct users?  Probably not very many.

We do find that straight-up MPI-IO isn't a good fit for a lot of
scientific applications.  The convienence factor you mentioned is
indeed important.  MPI-IO thinks of data as "stream of bytes", while
applications think in terms of "multidimentional typed data" (a slice
of upper atmosphere).

Libraries like Parallel-HDF5 and Parallel-NetCDF bridge the gap and
provide a convienent, familiar API.  The app is still using MPI-IO,
just not directly.

> NFS certainly does as well.  you just have to know the constraints.
> are you saying you can never get pathological or incorrect results from
> parallel operations on the same file on any of those FS's?

You observe correctly that file systems offer a set of rules on what
to expect from I/O patterns.  These consistency semantics are not set
in stone: MPI-IO consistency semantics are more relaxed than POSIX,
yet generally sufficent for parallel scientific applicaitons.   

We would consider it a serious bug in PVFS if simultaneous
non-overlapping writes corrupted data.

If the only file system I had access to was NFS, I'd do one file per
process as well. 

> starting with the question: "do you have a good reason to be writing in 
> parallel to the same file?".  I'm not saying the answer is never yes.
> 
> I guess I tend to value portability by obscurity-avoidance.  not if it makes
> life utter hell, of course, but...

one file per processor falls down on systems like BGL (where even a
small run is 1024 processes, and 128k is not unheard of).  

One file per process also robs the higher layers of the I/O software
stack from an opportunity to optimize access patterns.  All processes
reading a collumn out of a row-major array is noncontiguous (and
generally slow) in file-per-processor, but can be contiguous in
single-file after applying data shipping or two-phase collective
buffering optimizations.  

Jeff touched on the data management issues of file-per-processor.

If file-per-processor really is the most portable and convienent way
to work on data, well, I can't argue with that.  On NFS, that's
probably the only way to get correct results.   The single-file
approach, however, has significant benefits on the modern parallel
file systems available today.

As I hope you could tell, this kind of discussion is a lot of fun for
me.  Thanks!

==rob

-- 
Rob Latham
Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
Argonne National Lab, IL USA                 B29D F333 664A 4280 315B