[poppler] pdftohtml: invert mask + heuristic

Albert Astals Cid aacid at kde.org
Wed Apr 11 15:20:46 PDT 2012


El Dissabte, 31 de març de 2012, a les 01:23:17, Ihar `Philips` Filipau va 
escriure:
> Hi!
> 
> First, an admission. When introducing the original pdftohtml's "mask
> extraction as PNG" patch, it seems I got the colors wrong. Yeah, masks
> have only two of them - and I got them wrong. Both. :)
> 
> But just to be sure let me ask: 0.0 is black? and 1.0 is white?
> 
> As per recommendation of Leonard Rosenthol, who kindly quoted some
> documentation for me and hinted where to look further, I have tried to
> come up with a method to detect the mask inversion.
> 
> Note that it is a mask inversion different from the decode array
> inversion. It is not even a real inversion. Simpler example (as I have
> it in my documents) is that mask itself looks like negative, and the
> background/foreground colors in the document are swapped. Negative
> mask is used to paint white on black background, while the rest of the
> document has white background.
> 
> Since in case of pdftohtml, it is impossible to know the background, I
> use simple heuristic: if getFillGray() is greater than 0.5, I assume
> the mask is painted with light color over dark and thus mask's
> inversion flag (as indicated by the decode array) should be inverted.
> (Relies on the  presumption that background of most documents is
> light.)
> 
> Attached are two different (and conflicting) patches.
> 
> - proper-mask-color-001.diff
>    One-liner to clear my conscience. Use proper non-inverted colors
> for PNG. My original error stemmed from me reading libpng
> documentation. Indeed, libpng requires bit flip for the special case
> of monochrome images. But. PNGWriter doesn't use monochrome PNGs - it
> uses the grayscale instead, which doesn't require bit/byte flip.

Commited.

> 
> - invert-mask-001.diff
>   Implement inversion of the mask, if that is required by the decode
> array or background/foreground colors appear to be swapped. The
> heuristic is just 4 lines, probably unreliable but "works for me" -
> and thus I will not object for the 4 lines to be removed.

Don't think it makes sense to do this, a mask is a mask, not an image, and 
like a mask shall be extracted imho. Or just don't extract it, but try to 
guess stuff will result in problems.

Albert

> 
> Thanks and have a nice weekend!


More information about the poppler mailing list