[poppler] pdftohtml outputs hidden text
Albert Astals Cid
aacid at kde.org
Wed Nov 4 12:05:12 PST 2009
A Dimecres, 4 de novembre de 2009, Piotr Findeisen va escriure:
> I run across a problem that pdftohtml and pdftotext sometimes outputs
> hidden text, even when not using -hidden switch (in pdftohtml).
> wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext
> 114.pdf - | grep 'Picture to be added here'
> When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in
> Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on
> the first page, right under the " Typical BWA network layout." image.
> But well, it's not really displayed there.
> "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as
> <fontspec id="16" size="13" family="Times" color="#0000ff"/>
> but it gives no clue that the text is not printed on the screen.
> Is this some special feature of PDF that causes some text to be not
> displayed or displayed with 0% opacity?
From a quick look at the code it seems the code is creating a clip path
outside where the text is rendered, effectively rendering nothing.
> Is it possible to capture this meta data with pdftohtml or generally
> with poppler suite?
It is, you'll have to make the text tools take the clip areas into account,
not an easy task.
> best regards,
More information about the poppler