[poppler] pdftohtml outputs hidden text

Piotr Findeisen piotr.findeisen at azouk.com
Wed Nov 4 02:38:32 PST 2009


Hi!

I run across a problem that pdftohtml and pdftotext sometimes outputs
hidden text, even when not using -hidden switch (in pdftohtml).
Example:

wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext
114.pdf - | grep 'Picture to be added here'

When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in
Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on
the first page, right under the " Typical BWA network layout." image.
But well, it's not really displayed there.

"pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as
<fontspec id="16" size="13" family="Times" color="#0000ff"/>
but it gives no clue that the text is not printed on the screen.

Is this some special feature of PDF that causes some text to be not
displayed or displayed with 0% opacity?
Is it possible to capture this meta data with pdftohtml or generally
with poppler suite?

best regards,
Piotr

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 253 bytes
Desc: OpenPGP digital signature
Url : http://lists.freedesktop.org/archives/poppler/attachments/20091104/3330ce8c/attachment.pgp 


More information about the poppler mailing list