[poppler] Extra spaces in text when using Poppler pdftotext

Wed May 29 09:13:50 PDT 2013

On 5/29/13, Runar Buvik <runarb at gmail.com> wrote:
>
> This looks like a very large increase to me. From 0.03 to 0.25 is
> almost an order of magnitude. Do you guys think this will interfere
> when converting other PDF files?

The crux of your problem is that there is no way to find the real
width of the character. There is no way to know the width of the
space, since it depends on the producer and how it handles
kerning/etc. And as you might have guessed, PDF's do not contain
whitespaces - the letters/words are simply painted at a distance.
pdfto*, with the heuristics and carefully chosen constants, merely
reflects the most common scenarios which cover majority of the
documents. But might fail in niche cases.

Another point. Your document uses rather wide monospace font. The
default parameters are tuned for non-monospace fonts.

Also, keep in mind the opposite problem: if the constant is too high,
than the space between words would be treated as space between
characters and the words would be merged together. I find that problem
to be even harder to reverse.

> I need to extract text from hundreds
> of thousands of pdf files, so I need something that works well on all.
> Can't manually look for this space issue and use a different converter
> then...
>

There is no 100% reliable way to extract information from PDF. That's
actually why some use PDF, to make it hard to extract information from
the documents.

>
> I noticed that there is some code for printing debugging info in the
> TextOutputDev.cc file. When studying the debugging info it looks to me
> that some words are treated as whole words, others as a set of
> characters occurring after its other. For example the word "Jeanne" is
> a word, but the word "Frau" is represented as 4 characters. Isn't
> there a way to change this? Why can't one just say that all textual
> data that ends wide a space or newline be treated as a word?
>

PDF is a vector graphics format. There is no such thing as "word"
there. There are only functions to paint a string of 1 or more
characters at given page offset with given font. You get the idea.

Best course of action I can recommend is to go to the source of the
PDFs and demand information in textual/WinWord/etc form, not PDF.

If that is not an option, then you can try to use "pdftohtml -xml" and
then manually extract the words from the output XML. That's what I
did, since additionally I had to de-hyphenate and detect paragraphs
and page breaks (and run the result through the spell checker).

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)