[poppler] poppler util pdftohtml
Peter A. Kerzum
kerzum at yandex-team.ru
Fri Sep 23 04:44:04 PDT 2011
On Friday 23 September 2011 15:12:28 Leonard Rosenthol wrote:
> On 9/23/11 6:38 AM, "Jonathan Kew" <jfkthame at googlemail.com> wrote:
> >Once you start dealing with whole paragraphs, multiple columns, table
> >cells, etc, etc, things only get worse.... you may get good results for a
> >limited class of documents (e.g. unidirectional LTR text, fairly simple
> >block layouts), but the general problem for arbitrary PDF documents is
> >MUCH harder.
>
> Agreed 100%!
>
> Which is why I WISH I convince more PDF production tools to generated
> tagged/structured PDF!
That is very nice to hear from you =)
Actually consistent To-Unicode mapping should be a good compromise, as higher
level software can really segment text into regions of different languages
based solely on their alphabets and then detect and correct text flow for each
particular region
This way the example
english WERBEH
should generaly work being decomposed into 2 regions with the latter reversed
--
Пётр Керзум
More information about the poppler
mailing list