[Libreoffice] Windows installer size reduction effort - the compression theory
Michael Meeks
michael.meeks at novell.com
Mon Feb 28 13:51:55 PST 2011
Hi Wols,
On Mon, 2011-02-28 at 18:04 +0000, Wols Lists wrote:
> > So - personally, I find it amazing that ZIP is better than LZMA - ever,
> > it is such an older compression algorithm, and this flies in the face of
> > everything I've seen in practise with it [ eg. we use LZMA for our RPMs
> > in openSUSE these days AFAIR, better than bzip2, which in turn was
> > better than .tgz (essentially zip) ].
>
> Why the surprise? How long has Huffman been around - 30 years? 40? And
> you CANNOT do better than Huffman (there are other equally as good
> algorithms, though).
Ho hum; glad to know Huffman compression is optimal ;-) (personally I'd
go for arithmetic compression with a good "guess a libreoffice" model
alogorithm to get it down to a handful of bits ;-). Notice - that ZIP is
in fact two layered compression algorithm: Lempel Ziv, and then Huffman,
it is not clear that this makes it obviously non-optimal but it is
certainly somewhat weaker in that regard.
#!/usr/bin/perl -w
for (my $i = 0; $i < 10000; $i++) { print "$i\n"; }
type size
plain 48k
zip 23k
bzip2 11k
lzma 3k
That is not a very representative input (of course ;-), but perhaps not
so far away from the rampant duplication we face. I would (personally)
expect to see lzma do better than plain deflate. Of course, you can
easily imagine a compression algorithm, that turns the file into just
those two lines of perl above ;-)
> Having tried to back up a hard disk over ethernet (dd if=/dev/sda
> of=//samba/archive.hdi), it was direly slow over a 10Mb card. So I tried
> to pipe it through bzip2 to speed up the transfer - it was *even*
> *slower* !
bzip2 is not fast; that is certainly so, OTOH it was far better space
wise than the deflate it replaced; this is why much source code is still
packaged as .bz2 - if you want to bend your mind, read up on the
Burrows-Wheeler transform [ FWIW ;-].
> Why do you think LZMA is better than zip - because it's faster?
More efficient. Slowness of compression we can cope with easily, as
long as decompression is fast: and for most algorithms this balance is
easyish to achieve.
> (I think we need to add compression time to KAMI's nice table :-)
As long as it is within five or so minutes of CPU time on a fast
machine, and/or preferably is multi-thread-able [ though our choice of
tools and platforms strongly limits what we can use at the cab level
anyway ] - compression time is fairly irrelevant.
ATB,
Michael.
--
michael.meeks at novell.com <><, Pseudo Engineer, itinerant idiot
More information about the LibreOffice
mailing list