[poppler] PDF tables and parsing errors

Jeroen Ooms jeroen.ooms at stat.ucla.edu
Fri Feb 26 10:40:14 UTC 2016


We are using poppler for parsing and indexing scientific articles. For
this purpose I wrote some bindings to poppler-cpp for the R
programming language. A few questions:

 - Many of our pdf files give parsing errors, such as "Failed to get
object num from hint tables" or "Expected the optional content group
list, but wasn't able to find it" or "insufficient arguments for
Marked Content". Examples of problematic pdf files are here:
https://github.com/sckott/pdftoolspdfs. Are all of these pdf files
corrupted or are these limitations in poppler? Each of these files
seem to open just fine in any pdf reader.

- Is there any sensible way to extract tabular data from pdf documents
in a machine readable form (such as xml or csv or html)? I noticed
that pdftotext with the -layout option does a really nice job
positioning the table contents so I suppose poppler must have picked
up on the table internally?


More information about the poppler mailing list