[Beowulf] dedupe filesystem
Lawrence Stewart
stewart at serissa.com
Mon Jun 8 05:42:23 PDT 2009
On Jun 8, 2009, at 12:55 AM, Joe Landman wrote:
> Lawrence Stewart wrote:
>
> [...]
>
>>> Yup this is it, but on the fly is the hard part. Doing this
>>> comparison is computationally very expensive. The hash
>>> calculations are not cheap by any measure. You most decidedly do
>>> not wish to do this on the fly ...
>
> The assumption of a high performance disk/file system is implicit
> here.
>
>>>
>>>> And for that, it’s all about trading between cost of storage
>>>> space, retrieval time, and computational effort to run the
>>>> algorithm.
>>>
>>> Exactly.
>> I think the hash calculations are pretty cheap, actually. I just
>> timed sha1sum on a 2.4 GHz core2 and it runs at 148 Megabytes per
>> second, on one core (from the disk cache). That is substantially
>> faster than the disk transfer rate. If you have a parallel
>> filesystem, you can parallize the hashes as well.
>
> Disk transfer rates are 100-120 MB/s these days. For high
> performance local file systems (N * disks), the data rates you
> showed for sha1 won't cut it. Especially if they have multiple hash
> signatures computed in order to avoid hash collisions (ala MD5 et
> al). The probability of two different hashes having two or more
> unique blocks that have a collision in the same manner is somewhat
> smaller than the probability that a single hash function has two (or
> more) unique blocks that have a collision. So you compute two or
> more in different ways, to reduce the probability of that collision.
>
> For laughs, I just tried this on our lab server. We have a 500 MB/s
> file system attached to it, so we aren't so concerned about data IO
> bottlenecks.
Actual measurements and careful analysis beat handwaving every time :-)
My assumptions were a bit different I guess, still figuring 50MB/s per
spindle, and supposing that <clients> are computing the hashes, rather
than the servers running the disks. If that could be arranged, then
the cores available to compute hashes scale with the clients.
In any case, I am <not> arguing for deduplication for high performance
filesystems, I think it is a decent idea for backups though.
Regarding hash collisions, the relevant math is the "birthday problem"
I think. If the hash values are uniformly distributed, as they should
be, then the probability of a collision rises to about 1/2 when the
number of blocks reaches the square root of the size of the value
space. So you would have about 50% chance of a collision <somewhere>
if you have 4 billion blocks (32 bits) and are using 64 bit hashes.
If multiple hashes are independent, the you get to add the sizes
before taking the square root. 256 bit hashes ought to give
negligible odds of a collision up to 64 bits worth of blocks, where
"negligible" means much less than other sources of permanently lost
data.
However it might make more sense from a system design perspective to
use the hash as a hint, and to actually compare the data. This would
force a random block read to confirm every duplicate. Hmm, let's
convert sequential writes into random reads...
The compression ratio of 4K blocks to 16 byte hashes is also suspect,
this is 250 to one, and the incremental cost ratio of disk and ram is
not much different. So keeping the hashes in ram is probably too
expensive.
So in summary, deduplication is messy, complicated, has bad
performance, and uncertain economics for HPC. Let's not do it.
-L
More information about the Beowulf
mailing list