[Libreoffice] Windows installer size reduction effort - the compression theory

Mon Feb 28 16:21:41 PST 2011

On 28/02/11 21:51, Michael Meeks wrote:
>> Why the surprise? How long has Huffman been around - 30 years? 40? And
>> > you CANNOT do better than Huffman (there are other equally as good
>> > algorithms, though).

> 	Ho hum; glad to know Huffman compression is optimal ;-) (personally I'd
> go for arithmetic compression with a good "guess a libreoffice" model
> alogorithm to get it down to a handful of bits ;-). Notice - that ZIP is
> in fact two layered compression algorithm: Lempel Ziv, and then Huffman,
> it is not clear that this makes it obviously non-optimal but it is
> certainly somewhat weaker in that regard.

Interesting ...

I checked on wikipedia after I posted and Huffman is actually SIXTY
years old! Set as a class project, and published in 1952.

So zip should be tricky to beat for compression if it actually uses
Huffman, although I would expect it to take ages.

Your figures imply either (a) zip doesn't do Huffman, or (b) bzip2 and
lzma use a much larger input block size.

Was the wink to say you knew it was optimal already? I thought that was
common knowledge :-) although reading wikipedia it reminded me that
there is a pathological case - lots of identical elements. Given that I
gather Windows files contain large sections of nulls, Huffman won't like
that! That's probably the main time you *do* want double-compression -
some form of RLE will suppress any pathological input into Huffman.

The main reason I suggested adding times to KAMI's table was that it
makes explicit the time/compression tradeoff.

Cheers,
Wol