[poppler] Extra spaces in text when using Poppler pdftotext
Runar Buvik
runarb at gmail.com
Wed May 29 07:38:22 PDT 2013
Thank you for your answer William. I have done some experimenting as
you suggested, using both the original document and the example file I
posted earlier. I found that most of the words become correctly
formatted if I increase maxCharSpacing from 0.03 to 0.25. Below 0.25
many words get thus extra spaces, and above many words get blended
together without spaces between them.
This looks like a very large increase to me. From 0.03 to 0.25 is
almost an order of magnitude. Do you guys think this will interfere
when converting other PDF files? I need to extract text from hundreds
of thousands of pdf files, so I need something that works well on all.
Can't manually look for this space issue and use a different converter
then...
I noticed that there is some code for printing debugging info in the
TextOutputDev.cc file. When studying the debugging info it looks to me
that some words are treated as whole words, others as a set of
characters occurring after its other. For example the word "Jeanne" is
a word, but the word "Frau" is represented as 4 characters. Isn't
there a way to change this? Why can't one just say that all textual
data that ends wide a space or newline be treated as a word?
Example of debug text:
*** initial words ***
word: x=47.04..74.58 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) '12345'
word: x=81.60..86.89 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) '-'
word: x=93.12..97.50 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) 'F'
word: x=98.75..103.12 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) 'r'
word: x=104.37..108.75 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) 'a'
word: x=110.00..114.37 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) 'u'
word: x=122.16..154.41 y=120.66..130.86 base=128.16 fontSize=9.00
rot=0 link=(nil) 'Jeanne'
Best regards
Runar Buvik
CTO Searchdaimon As
+47 93 03 06 27
http://www.searchdaimon.com
More information about the poppler
mailing list