[Beowulf] Accelerator for data compressing
Vincent Diepeveen
diep at xs4all.nl
Fri Oct 3 08:55:08 PDT 2008
The question is Joe,
Why are you storing it uncompressed?
Vincent
On Oct 3, 2008, at 5:45 PM, Joe Landman wrote:
> Carsten Aulbert wrote:
>
>> If 7-zip can only compress data at a rate of less than say 5 MB/s
>> (input
>> data) I can much much faster copy the data over uncompressed
>> regardless
>> of how many unused cores I have in the system. Exactly for these
>> cases I
>> would like to use all cores available to compress the data fast in
>> order
>> to increase the throughput.
>
> This is fundamentally the issue. If the compression time plus the
> tranmit time for the compressed data is greater than the transmit
> time for the uncompressed data, then the compression may not be
> worth it. Sure, if it is nothing but text files, you may get 60-80+
> % compression rates. But for the case of (non-pathological) binary
> data, it might be only a few percent. So in this case, even if
> you could get a few percent delta from the compression, is that
> worth all the extra time you spend to get it?
>
> At the end of the day the question is how much lossless compression
> can you do in a short enough time for it to be meaningful in terms
> of transmitting the data?
>
>> Do I miss something vital?
>
> Nope. You got it nailed.
>
> Several months ago, I tried moving about 600 GB of data from an old
> server to a JackRabbit. The old server and the JackRabbit had a
> gigabit link between them. We regularly saw 45 MB scp rates (one
> of the chips in the older server was a Broadcom).
>
> I tried this with and without compression. With compression
> (simple gzip), the copy took something like 28 hours ( a little
> more than a day). Without compression, it was well under 10 hours.
>
> In this case, compression (gzip) was not worth it. The command I
> used for the test was
>
> uncompressed:
>
> cd /directory
> tar -cpf - ./ | ssh jackrabbit "cd /directory ; tar -xpvf - "
>
> compressed:
>
> cd /directory
> tar -czpf - ./ | ssh jackrabbit "cd /directory ; tar -xzpvf - "
>
> if you want to spend more time, use "j" rather than "z" in the
> options.
>
> YMMV, but I have been convinced that, apart from specific use cases
> with text only documents or documents known to compress quickly/
> well, that compression prior to transfer may waste more time than
> it saves.
>
> This said, if someone has a parallel hack of gzip or similar we can
> pipe through, by all means, I would be happy to try it. But it
> would have to be pretty darned efficient.
>
> 100MB/s means 1 byte transmitted,on average, in 10 nanoseconds.
> Which means for compression to be meaningful, you would need to
> compute for less time than that to increase the information
> density. Put another way, 1 MB takes about 10 ms to send over a
> gigabit link. For compression to be meaningful, you need to
> compress this 1MB in far less than 10 ms and still transmit it in
> that time. Assuming that any compression algorithm has to walk
> through data at least once, A 1 GB/s memory subsystem takes about
> 1 ms to walk through this data at least once, so you need as few
> passes as possible through the data set to construct the compressed
> representation, as you will still have on the order of 1E+5 bytes
> to send.
>
> I am not saying it is hopeless, just hard for complex compression
> schemes (bzip2, etc). When we get enough firepower in the CPU (or
> maybe GPU ... hmmmm) the situation may improve.
>
> GPU as a compression engine? Interesting ...
>
> Joe
>
>> Cheers
>> Carsten
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax : +1 866 888 3112
> cell : +1 734 612 4615
>
More information about the Beowulf
mailing list