[Beowulf] dedupe filesystem

Wed Jun 3 02:54:38 PDT 2009

On Tue, 2 Jun 2009, John Hearns wrote:

> In HPC one would hope that all files are different, so 
> de-duplication would not be a feature which you need.

I beg to differ, at least in the academic environment where I come 
from. Image these 2 scenarios:

1. copying between users

 	step1: PhD student does a good job, produces data and writes
 		thesis, then leaves the group but keeps data around
 		because the final paper is still not written; in
 		his/her new position, there's no free time so the
 		final paper advances very slowly; however data can't
 		be taken on a slow medium because it's still actively
 		worked on
 	step2: another PhD student takes over the project and does
 		something else that needs the data, so a copy is
 		created(*).
 	step3: a short term practical work by an undergraduate student
 		which colaborates with the step2 PhD student needs
 		access to the data; as the undergrad student is not
 		trusted (he/she can make mistakes that delete/modify
 		the data), another copy is created

(*) copies are created for various reasons:
 	privacy or intelectual property - people protect their data
 		using Unix file access rights or ACLs, the copying is
 		done with their explicit consent, either by them or by
 		the sysadmin.
 	fear of change - people writing up (or hoping to) don't want
 		their data to change, so that they can f.e. go back and
 		redo the graph that the reviewer asked for. They are
 		particularly paranoid about their data and would prefer
 		copying than allowing other people to access it directly.
 	lazyness - there can be technical solutions for the above 2
 		reasons, but if the people involved don't want to make
 		the effort to use them, copying seems like a much easier
 		solution.

2. copying between machines

    Data is stored on a group file server or on the cluster where is
    was created, but needs to be copied somewhere else for a more
    efficient (mostly from I/O point of view) analysis. A copy is made,
    but later on people don't remember why the copy was made and if
    there was any kind of modification to the data. Sometimes the
    results of the analysis (which can be very small compared with the
    actual data) are stored there as well, making the whole set look
    like a "package" worthy of being stored together, independent of
    the original data. This "package" can be copied back (so the two
    copies live in the same file system) or can remain separate (which
    can make it harder to detect as copies).

I do mean all these in a HPC environment - the analysis mentioned 
before can involve reading multiple times files ranging from tens of 
GB to TB (for the moment...). Even if the analysis itself doesn't run 
as a parallel job, several (many) such jobs can run at the same time 
looking for different parameters. [ the above scenarios actually come 
from practice - not imagination - and are written with molecular 
dynamics simulations in mind ]

Also don't forget backup - a HPC resource is usually backed up, to 
avoid loss of data which was obtained with precious CPU time (and 
maybe an expensive interconnect, memory, etc).

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de