[poppler] pdftohtml: invert mask + heuristic

Wed Apr 11 16:04:05 PDT 2012

On 4/12/12, Albert Astals Cid <aacid at kde.org> wrote:
> El Dissabte, 31 de març de 2012, a les 01:23:17, Ihar `Philips` Filipau va
> escriure:
>
> Commited.

Thanks.

>> - invert-mask-001.diff
>>   Implement inversion of the mask, if that is required by the decode
>> array or background/foreground colors appear to be swapped. The
>> heuristic is just 4 lines, probably unreliable but "works for me" -
>> and thus I will not object for the 4 lines to be removed.
>
> Don't think it makes sense to do this, a mask is a mask, not an image, and
> like a mask shall be extracted imho. Or just don't extract it, but try to
> guess stuff will result in problems.

I had this in mind when I posted the patches. I disagree with the "try
to guess stuff" comment, but that's OK. (Shortly: all of the
pdftotext/pdftohtml/friends are all about guessing stuff - guessing
words, guessing lines, guessing colors, guessing fonts. Heck,
TextOutputDev is part of poppler (not poppler-utils!) and it does
guessing about hyphenation - much much worse offense as it modifies
text being extracted from PDF.)

Of all the PDFs I have went through (more than 50 now), not a single
one had a mask used as a mask - all mask images were used exclusively
to represent a monochrome image: gothic chapter delimiter or a diagram
or a logo. But nevermind, at least now we extract the images, and they
can be postprocessed manually later: ImageMagick's `convert -negate`
does the job. It's not the worst I have seen from the PDF.

best regards, happy weekend and, of course, have a nice release! ;)