[Beowulf] dedupe filesystem
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Wed Jun 3 02:54:38 PDT 2009
On Tue, 2 Jun 2009, John Hearns wrote:
> In HPC one would hope that all files are different, so
> de-duplication would not be a feature which you need.
I beg to differ, at least in the academic environment where I come
from. Image these 2 scenarios:
1. copying between users
step1: PhD student does a good job, produces data and writes
thesis, then leaves the group but keeps data around
because the final paper is still not written; in
his/her new position, there's no free time so the
final paper advances very slowly; however data can't
be taken on a slow medium because it's still actively
worked on
step2: another PhD student takes over the project and does
something else that needs the data, so a copy is
created(*).
step3: a short term practical work by an undergraduate student
which colaborates with the step2 PhD student needs
access to the data; as the undergrad student is not
trusted (he/she can make mistakes that delete/modify
the data), another copy is created
(*) copies are created for various reasons:
privacy or intelectual property - people protect their data
using Unix file access rights or ACLs, the copying is
done with their explicit consent, either by them or by
the sysadmin.
fear of change - people writing up (or hoping to) don't want
their data to change, so that they can f.e. go back and
redo the graph that the reviewer asked for. They are
particularly paranoid about their data and would prefer
copying than allowing other people to access it directly.
lazyness - there can be technical solutions for the above 2
reasons, but if the people involved don't want to make
the effort to use them, copying seems like a much easier
solution.
2. copying between machines
Data is stored on a group file server or on the cluster where is
was created, but needs to be copied somewhere else for a more
efficient (mostly from I/O point of view) analysis. A copy is made,
but later on people don't remember why the copy was made and if
there was any kind of modification to the data. Sometimes the
results of the analysis (which can be very small compared with the
actual data) are stored there as well, making the whole set look
like a "package" worthy of being stored together, independent of
the original data. This "package" can be copied back (so the two
copies live in the same file system) or can remain separate (which
can make it harder to detect as copies).
I do mean all these in a HPC environment - the analysis mentioned
before can involve reading multiple times files ranging from tens of
GB to TB (for the moment...). Even if the analysis itself doesn't run
as a parallel job, several (many) such jobs can run at the same time
looking for different parameters. [ the above scenarios actually come
from practice - not imagination - and are written with molecular
dynamics simulations in mind ]
Also don't forget backup - a HPC resource is usually backed up, to
avoid loss of data which was obtained with precious CPU time (and
maybe an expensive interconnect, memory, etc).
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the Beowulf
mailing list