[poppler] pdftohtml outputs hidden text

Wed Nov 4 12:05:12 PST 2009

A Dimecres, 4 de novembre de 2009, Piotr Findeisen va escriure:
> Hi!
> 
> I run across a problem that pdftohtml and pdftotext sometimes outputs
> hidden text, even when not using -hidden switch (in pdftohtml).
> Example:
> 
> wget -c http://www2.ing.unipi.it/ew2002/proceedings/114.pdf && pdftotext
> 114.pdf - | grep 'Picture to be added here'
> 
> When you view http://www2.ing.unipi.it/ew2002/proceedings/114.pdf in
> Kpdf or Acrobat, you can search for 'Picture to be added here' — it's on
> the first page, right under the " Typical BWA network layout." image.
> But well, it's not really displayed there.
> 
> "pdftohtml -xml -i -c -f 1 -l 1 -noframes 114.pdf" lists this text as
> <fontspec id="16" size="13" family="Times" color="#0000ff"/>
> but it gives no clue that the text is not printed on the screen.
> 
> Is this some special feature of PDF that causes some text to be not
> displayed or displayed with 0% opacity?

From a quick look at the code it seems the code is creating a clip path 
outside where the text is rendered, effectively rendering nothing.

> Is it possible to capture this meta data with pdftohtml or generally
> with poppler suite?

It is, you'll have to make the text tools take the clip areas into account, 
not an easy task.

Albert

> 
> best regards,
> Piotr
>