[poppler] line brakes and layout for multi-column texts ...

Albretch Mueller lbrtchx at gmail.com
Thu Feb 6 11:19:23 UTC 2020


On 2/5/20, Albert Astals Cid <aacid at kde.org> wrote:
> El dimecres, 5 de febrer de 2020, a les 12:20:10 CET, Albretch Mueller va
> escriure:
>>  pdftotext has the option
>>
>> -layout              : maintain original physical layout
>>
>>  but pdftohtml doesn't
>
> pdftotext and pdftohtml use different code/algorithms

 that explains it. Thank you. I thought I was missing something

> you'd have to see if
> one can be adapted/improved for the other.

 Well, yes. Definitely the way to go. You will have to "go monkey" and
employ a bit of heuristics to make pdfto* dance it well for you. If
you know that most documents will be of the multi-column kinds:

 1) run pdftotext with and with out layout
 2) some line by line analysis of the result of both
 3) pdftohtml
 4) do some line by line algorithmic consolidation of all three texts
based on §1, §2, §3

 that should do it!

 I will post the link to the code here once I am done with it

 lbrtchx


More information about the poppler mailing list