[poppler] poppler util pdftohtml
Peter A. Kerzum
kerzum at yandex-team.ru
Fri Sep 23 05:37:52 PDT 2011
On Friday 23 September 2011 15:59:18 Jonathan Kew wrote:
> On 23 Sep 2011, at 12:44, Peter A. Kerzum wrote:
> > Actually consistent To-Unicode mapping should be a good compromise, as
> > higher level software can really segment text into regions of different
> > languages based solely on their alphabets and then detect and correct
> > text flow for each particular region
> >
> > This way the example
> >
> > english WERBEH
> >
> > should generaly work being decomposed into 2 regions with the latter
> > reversed
>
> But what is the order of those "2 regions"? You cannot tell unless you have
> some higher-level info... the purely visual presentation is inherently
> ambiguous.
Yes, that's a question. But in most practical cases regions should correspond
to paragraphs, so you can order them 'top to down', at least for horizontal
aphabets.
In the worst case you'll come up with intermixed words, that's anyway better
than intermixed letters in a word.
--
Пётр Керзум
Группа разработки поисковой платформы
СПб, тел. 8508
More information about the poppler
mailing list