[poppler] poppler util pdftohtml

Peter A. Kerzum kerzum at yandex-team.ru
Fri Sep 23 04:44:04 PDT 2011


On Friday 23 September 2011 15:12:28 Leonard Rosenthol wrote:
> On 9/23/11 6:38 AM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
> >Once you start dealing with whole paragraphs, multiple columns, table
> >cells, etc, etc, things only get worse.... you may get good results for a
> >limited class of documents (e.g. unidirectional LTR text, fairly simple
> >block layouts), but the general problem for arbitrary PDF documents is
> >MUCH harder.
> 
> Agreed 100%!
> 
> Which is why I WISH I convince more PDF production tools to generated
> tagged/structured PDF!

That is very nice to hear from you =)
Actually consistent To-Unicode mapping should be a good compromise, as higher 
level software can really segment text into regions of different languages 
based solely on their alphabets and then detect and correct text flow for each 
particular region

This way the example

   english WERBEH

should generaly work being decomposed into 2 regions with the latter reversed

-- 
Пётр Керзум


More information about the poppler mailing list