[poppler] text extraction in raw order + text attributes

Fri Dec 6 08:51:24 PST 2013

Hi!

I'm trying to use poppler to extract text from PDFs, and I've found 
empirically
that using the "raw order" option gives better results (I can supply example
files where non-raw order returns mangled text, if needed).

This option is only exposed for the C++ bindings, not the Glib ones.
I could use either binding, but I also need something like poppler-glib's
"poppler_page_get_text_attributes".

As far as I can see, I could either:

* hack something so I can extract text in raw-order using the Glib-bindings
   (I'd prefer staying C-only, but I don't see how this would be possible,
    except by adding it to the bindings)

* or re-implement poppler_page_get_text_attributes in C++, using poppler's
   private API (or take poppler's implementation)

What do you think would be the best way to go about that?

Thanks!

Richard

PS:

My use case, in case there's an even better way to do that: I'm trying to
heuristically extract titles and authors of PDFs without usable metadata.
The backend has a bunch of rules like "the thing with the biggest font 
size is
probably the title". This works surprisingly well - except for said PDFs
where poppler_page_get_text only returns garbage, obviously.