[poppler] text extraction in raw order + text attributes

Mon Dec 9 03:05:47 PST 2013

On 12/07/13 12:43, Carlos Garcia Campos wrote:
> Richard Wossal <richard at r-wos.org> writes:
>
>> Hi!
>>
>> I'm trying to use poppler to extract text from PDFs, and I've found
>> empirically
>> that using the "raw order" option gives better results (I can supply example
>> files where non-raw order returns mangled text, if needed).
> Yes, please it would help to see any of those examples.
Here are some samples:

If you save the following google doc as a PDF (File->Download as):
https://docs.google.com/document/d/1U6SsDnTIce3IH-GhdKpx_uStQQSzSCsACoPkvmZtqTc/edit?usp=sharing

$ pdftotext -v
pdftotext version 0.18.4
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
$ pdftotext ~/Downloads/sample.pdf - | head
This is a title
This is a subtitle

T iin r l x
h oma t t
ss

e
This is underlined text

$ pdftotext -raw ~/Downloads/sample.pdf - | head
This is a title
This is a subtitle
This is normal text
This is underlined text
This is a Heading
Here’s some nonascii stuff: öäüß§

Similar effects can be observed for the title page of
http://www.farmworkerjustice.org/sites/default/files/documents/7.2.a.6%20fwj.pdf

While looking at it more closely now, it appears that sometimes
non-raw reading order gives better results, as with
http://win.niddk.nih.gov/publications/pdfs/teenblackwhite3.pdf

$ pdftotext 'pdfs/teenblackwhite3.pdf' - | head
A Guide for
Teenagers!

Take

C h a rg e
of

Your

$ pdftotext -raw 'pdfs/teenblackwhite3.pdf' - | head
TakeTake
Charge
o f
Your Health!
A Guide for
Teenagers!

A GuideT fe oen r
TakeTake
agers!
Charge

(Just to give some sense as to the magnitude: the last two are from
a random sample of 100 PDFs my users threw at me. The google doc I
wrote myself, as a test case. So it's not exactly a huge problem.)

>> As far as I can see, I could either:
>>
>> * hack something so I can extract text in raw-order using the Glib-bindings
>>     (I'd prefer staying C-only, but I don't see how this would be possible,
>>      except by adding it to the bindings)
>>
>> * or re-implement poppler_page_get_text_attributes in C++, using poppler's
>>     private API (or take poppler's implementation)
>>
>> What do you think would be the best way to go about that?
> I you really need to get the text in raw order we can add new methods in
> the API for that. I'm thinking that maybe we could add a more generic
> text iteration API with options like area, order and even the break
> iterator (so that you can iter over characters, lines and words).
Being able to iterate over basically some kind of AST of the PDF
(say, chars+attributes) would be pretty nice indeed.

For myself, I've decided to go ahead with poppler-glib's
page_get_text_* for now. The failure rate is low enough for my
application. I was initially stumped that my simple google-doc test
case wouldn't parse correctly, but it doesn't seem to be such a big
problem with PDFs in the wild.

Thanks!

Richard