[poppler] PDF tables and parsing errors

Adrián Pérez de Castro aperez at igalia.com
Fri Feb 26 11:54:43 UTC 2016


Hi there Jeroen,

Quoting Jeroen Ooms (2016-02-26 12:40:14)
> We are using poppler for parsing and indexing scientific articles. For
> this purpose I wrote some bindings to poppler-cpp for the R
> programming language. A few questions:
> 
>  - Many of our pdf files give parsing errors, such as "Failed to get
> object num from hint tables" or "Expected the optional content group
> list, but wasn't able to find it" or "insufficient arguments for
> Marked Content". Examples of problematic pdf files are here:
> https://github.com/sckott/pdftoolspdfs. Are all of these pdf files
> corrupted or are these limitations in poppler? Each of these files
> seem to open just fine in any pdf reader.
> 
> - Is there any sensible way to extract tabular data from pdf documents
> in a machine readable form (such as xml or csv or html)? I noticed
> that pdftotext with the -layout option does a really nice job
> positioning the table contents so I suppose poppler must have picked
> up on the table internally?

Unfortunately, it's not that easy. Tables in PDFs are streams of commands to
paint lines and text at certain positions — as it is for most of the content
in PDFs.

Your best chance of getting actual information about structure of tables is
using Tagged-PDFs, which include additional semantic information about the
contents of the pages. We have support in Poppler to read the Tagged-PDF bits,
but none of the “pdfto*” conversion tools uses it. For a rough example on
how to do this, you can check the code for “pdfstructtohtml” [1] which,
unfortunately, is not included in official releases.

I hope that helps!

--
 ⌨ Adrian

---
[1] https://github.com/aperezdc/poppler/blob/tagged-pdf-utils/utils/pdfstructtohtml.cc
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: signature
URL: <https://lists.freedesktop.org/archives/poppler/attachments/20160226/dfa377a9/attachment.sig>


More information about the poppler mailing list