[poppler] Extra spaces in text when using Poppler pdftotext

Leonard Rosenthol lrosenth at adobe.com
Wed May 29 09:28:12 PDT 2013


On 5/29/13 12:13 PM, "Ihar `Philips` Filipau" <thephilips at gmail.com> wrote:
>There is no 100% reliable way to extract information from PDF.

This is a MUCH TOO COMMON "bubba meisa"
(<http://en.wiktionary.org/wiki/bubbe-meise>) about PDF.


>>PDF is a vector graphics format. There is no such thing as "word"
>there. There are only functions to paint a string of 1 or more
>characters at given page offset with given font. You get the idea.

This is simply NOT TRUE about the PDF file format.  PDF supports a very
rich semantic layer called "Tagged PDF" that has been part of the language
for almost 15 years now (since PDF 1.4).

However, it is true that many PDFs are created without this semantic
richness, which leads to difficulties in extraction.  And in that case, as
you recommend, the original source is probably best to work with since the
semantics are still present.


Leonard




More information about the poppler mailing list