[poppler] pdftohtml and HTML output

Toby Hewlett toby at billminder.co.za
Thu May 17 08:44:57 PDT 2012


Hi,

I've recently upgraded to Popper v0.18.4 from an older version v0.40 
which did not have the -nodrm switch.

The HTML generated by the older version used SPAN tags inside DIVs with 
positional information - for eaxmple:
<DIV style="position:absolute;top:1207;left:431"><nobr><span 
class="ft00">Page 1</span></nobr></DIV>

However I have noticed that the 0.18.4 creates HTML now uses paragraph 
<P> tags and often separates the text with spaces instead of separating 
it out positionally - for example:
<P style="position:absolute;top:270px;left:54px;white-space:nowrap" 
class="ft01">&#160;0214488062 &#160; &#160; &#160; &#160; DSL Fast 
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 
&#160; &#160; &#160; 04 May 12 - 03 Jun 12 &#160; &#160; &#160; &#160; 
&#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; &#160; 
&#160; &#160; &#160; &#160; &#160; &#160; &#160; R133.33</P>

I need accurate positional information of each piece of text, so <P> 
with spaces is not suitable, therefore I need to revert to a version 
that generates HTML with <DIVS> and <SPANS>, but which still includes 
the -nodrm switch.

Can anyone advise which version of Poppler utils might be suitable?

Thanks!

Regards
Toby Hewlett




More information about the poppler mailing list