[poppler] text extraction in raw order + text attributes
Carlos Garcia Campos
carlosgc at gnome.org
Sat Dec 7 03:43:24 PST 2013
Richard Wossal <richard at r-wos.org> writes:
> I'm trying to use poppler to extract text from PDFs, and I've found
> that using the "raw order" option gives better results (I can supply example
> files where non-raw order returns mangled text, if needed).
Yes, please it would help to see any of those examples.
> This option is only exposed for the C++ bindings, not the Glib ones.
> I could use either binding, but I also need something like poppler-glib's
poppler_page_get_text, get_text_layout and get_text_attributes returns
the text in reading order, using heuristics to follow columns and
tables. It's not perfect, of course, since it's based on heuristics.
> As far as I can see, I could either:
> * hack something so I can extract text in raw-order using the Glib-bindings
> (I'd prefer staying C-only, but I don't see how this would be possible,
> except by adding it to the bindings)
> * or re-implement poppler_page_get_text_attributes in C++, using poppler's
> private API (or take poppler's implementation)
> What do you think would be the best way to go about that?
I you really need to get the text in raw order we can add new methods in
the API for that. I'm thinking that maybe we could add a more generic
text iteration API with options like area, order and even the break
iterator (so that you can iter over characters, lines and words).
> My use case, in case there's an even better way to do that: I'm trying to
> heuristically extract titles and authors of PDFs without usable metadata.
> The backend has a bunch of rules like "the thing with the biggest font
> size is
> probably the title". This works surprisingly well - except for said PDFs
> where poppler_page_get_text only returns garbage, obviously.
What's exactly garbage?
Carlos Garcia Campos
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 197 bytes
Desc: not available
More information about the poppler