[poppler] Extra spaces in text when using Poppler pdftotext

Runar Buvik runarb at gmail.com
Mon May 27 05:34:39 PDT 2013


Hi

I am using your pdftotext program to extract text from a large number
of PDF files. Unfortunately some words get extra spaces between the
characters. For example in one PDF files the word “Wasserberg” appears
as “W a s s e r b e r g”.

However if I cut and paste this text from Acrobat Reader the text is
more correctly formatted.

The following image expanse this better:
http://bbh-001.boitho.com/div/pdf_space_bug/Text.png (notice how the
yellow text has extra spaces in it in the Popler version).

Any thoughts on how I can extract a more correct text?



An example PDF that gets converted like this is available at.
http://bbh-001.boitho.com/div/pdf_space_bug/example.pdf . It isn’t
perfect because we have to do some manual modification to remove
confidential information, but it shows the symptoms correctly.

The text is from a scan and then it is OCRed using ABBYY. I am using
Poppler 0.22.4.


Best regards
Runar Buvik
CTO Searchdaimon As
http://www.searchdaimon.com/


More information about the poppler mailing list