[poppler] pdftohtml: invert mask + heuristic

Ihar `Philips` Filipau thephilips at gmail.com
Fri Mar 30 16:23:17 PDT 2012


Hi!

First, an admission. When introducing the original pdftohtml's "mask
extraction as PNG" patch, it seems I got the colors wrong. Yeah, masks
have only two of them - and I got them wrong. Both. :)

But just to be sure let me ask: 0.0 is black? and 1.0 is white?

As per recommendation of Leonard Rosenthol, who kindly quoted some
documentation for me and hinted where to look further, I have tried to
come up with a method to detect the mask inversion.

Note that it is a mask inversion different from the decode array
inversion. It is not even a real inversion. Simpler example (as I have
it in my documents) is that mask itself looks like negative, and the
background/foreground colors in the document are swapped. Negative
mask is used to paint white on black background, while the rest of the
document has white background.

Since in case of pdftohtml, it is impossible to know the background, I
use simple heuristic: if getFillGray() is greater than 0.5, I assume
the mask is painted with light color over dark and thus mask's
inversion flag (as indicated by the decode array) should be inverted.
(Relies on the  presumption that background of most documents is
light.)

Attached are two different (and conflicting) patches.

- proper-mask-color-001.diff
   One-liner to clear my conscience. Use proper non-inverted colors
for PNG. My original error stemmed from me reading libpng
documentation. Indeed, libpng requires bit flip for the special case
of monochrome images. But. PNGWriter doesn't use monochrome PNGs - it
uses the grayscale instead, which doesn't require bit/byte flip.

- invert-mask-001.diff
  Implement inversion of the mask, if that is required by the decode
array or background/foreground colors appear to be swapped. The
heuristic is just 4 lines, probably unreliable but "works for me" -
and thus I will not object for the 4 lines to be removed.

Thanks and have a nice weekend!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: invert-mask-001.diff
Type: text/x-patch
Size: 2789 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20120331/e329406e/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: proper-mask-color-001.diff
Type: text/x-patch
Size: 467 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20120331/e329406e/attachment-0001.bin>


More information about the poppler mailing list