[poppler] poppler util pdftohtml

Peter A. Kerzum kerzum at yandex-team.ru
Fri Sep 23 05:37:52 PDT 2011


On Friday 23 September 2011 15:59:18 Jonathan Kew wrote:
> On 23 Sep 2011, at 12:44, Peter A. Kerzum wrote:
> > Actually consistent To-Unicode mapping should be a good compromise, as
> > higher level software can really segment text into regions of different
> > languages based solely on their alphabets and then detect and correct
> > text flow for each particular region
> > 
> > This way the example
> > 
> >   english WERBEH
> > 
> > should generaly work being decomposed into 2 regions with the latter
> > reversed
> 
> But what is the order of those "2 regions"? You cannot tell unless you have
> some higher-level info... the purely visual presentation is inherently
> ambiguous.

Yes, that's a question. But in most practical cases regions should correspond 
to paragraphs, so you can order them 'top to down', at least for horizontal 
aphabets.

In the worst case you'll come up with intermixed words, that's anyway better 
than intermixed letters in a word.

-- 
Пётр Керзум
Группа разработки поисковой платформы
СПб, тел. 8508


More information about the poppler mailing list