[poppler] Pdf TextBlock order
Johnny Mariéthoz
Johnny.Mariethoz at rero.ch
Wed Dec 1 08:58:54 PST 2010
Hello,
first of all many thanks for your excellent work.
I want to extract the text from a document or a pdf page. The text order should be the same as follows by a reader.
This tasks becomes difficult for multi-column document and for tables. As I want to format the paragraphs, I cannot
use makeWordList. I would go through TextFlow, TextBlock, Lines and Words. But I cannot obtain
the right order for a complex document such as:
http://doc.rero.ch/lm.php?url=1000,43,2,20101130144841-EO/mue_dmc.pdf
Do you have any strategies to re-order the blocks? Do the file contains informations about
the right sequence. As acroread, evince, and apple preview behave different, I can conclude
that it is not trivial. Am I right?
Many thanks in advance.
----------------------------------------------------------------------
Johnny Mariéthoz
RERO, Av. de la Gare 45, CH - 1920 MARTIGNY
Téléphone: +41(0)27 721 8579
Fax : +41(0)27 721 8586
Web : http://www.rero.ch
ReroDoc : http://doc.rero.ch, doc.support at rero.ch
----------------------------------------------------------------------
More information about the poppler
mailing list