[poppler] text extraction in raw order + text attributes

Carlos Garcia Campos carlosgc at gnome.org
Sat Dec 7 03:43:24 PST 2013

Richard Wossal <richard at r-wos.org> writes:

> Hi!
> I'm trying to use poppler to extract text from PDFs, and I've found 
> empirically
> that using the "raw order" option gives better results (I can supply example
> files where non-raw order returns mangled text, if needed).

Yes, please it would help to see any of those examples.

> This option is only exposed for the C++ bindings, not the Glib ones.
> I could use either binding, but I also need something like poppler-glib's
> "poppler_page_get_text_attributes".

poppler_page_get_text, get_text_layout and get_text_attributes returns
the text in reading order, using heuristics to follow columns and
tables. It's not perfect, of course, since it's based on heuristics. 

> As far as I can see, I could either:
> * hack something so I can extract text in raw-order using the Glib-bindings
>    (I'd prefer staying C-only, but I don't see how this would be possible,
>     except by adding it to the bindings)
> * or re-implement poppler_page_get_text_attributes in C++, using poppler's
>    private API (or take poppler's implementation)
> What do you think would be the best way to go about that?

I you really need to get the text in raw order we can add new methods in
the API for that. I'm thinking that maybe we could add a more generic
text iteration API with options like area, order and even the break
iterator (so that you can iter over characters, lines and words).

> Thanks!
> Richard
> PS:
> My use case, in case there's an even better way to do that: I'm trying to
> heuristically extract titles and authors of PDFs without usable metadata.
> The backend has a bunch of rules like "the thing with the biggest font 
> size is
> probably the title". This works surprisingly well - except for said PDFs
> where poppler_page_get_text only returns garbage, obviously.

What's exactly garbage?

Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20131207/f9fa3517/attachment.pgp>

More information about the poppler mailing list