[poppler] Pdf TextBlock order

Johnny Mariéthoz Johnny.Mariethoz at rero.ch
Wed Dec 1 08:58:54 PST 2010


Hello,

first of all many thanks for your excellent work.

I want to extract the text from a document or a pdf page. The text order should be the same as follows by a reader.
This tasks becomes difficult for multi-column document and for tables. As I want to format the paragraphs, I cannot
use makeWordList. I would go through TextFlow, TextBlock, Lines and Words. But I cannot obtain
the right order for a complex document such as:

http://doc.rero.ch/lm.php?url=1000,43,2,20101130144841-EO/mue_dmc.pdf

Do you have any strategies to re-order the blocks? Do the file contains informations about
the right sequence. As acroread, evince, and apple preview behave different, I can conclude 
that it is not trivial. Am I right?

Many thanks in advance.

----------------------------------------------------------------------
Johnny Mariéthoz
RERO, Av. de la Gare 45, CH - 1920 MARTIGNY
Téléphone:  +41(0)27 721 8579
Fax              : +41(0)27 721 8586
Web            : http://www.rero.ch
ReroDoc    : http://doc.rero.ch, doc.support at rero.ch
----------------------------------------------------------------------




More information about the poppler mailing list