[poppler] Extra spaces in text when using Poppler pdftotext
Runar Buvik
runarb at gmail.com
Mon May 27 05:34:39 PDT 2013
Hi
I am using your pdftotext program to extract text from a large number
of PDF files. Unfortunately some words get extra spaces between the
characters. For example in one PDF files the word “Wasserberg” appears
as “W a s s e r b e r g”.
However if I cut and paste this text from Acrobat Reader the text is
more correctly formatted.
The following image expanse this better:
http://bbh-001.boitho.com/div/pdf_space_bug/Text.png (notice how the
yellow text has extra spaces in it in the Popler version).
Any thoughts on how I can extract a more correct text?
An example PDF that gets converted like this is available at.
http://bbh-001.boitho.com/div/pdf_space_bug/example.pdf . It isn’t
perfect because we have to do some manual modification to remove
confidential information, but it shows the symptoms correctly.
The text is from a scan and then it is OCRed using ABBYY. I am using
Poppler 0.22.4.
Best regards
Runar Buvik
CTO Searchdaimon As
http://www.searchdaimon.com/
More information about the poppler
mailing list