[poppler] PDF tables and parsing errors

Sat Feb 27 10:48:32 UTC 2016

El Friday 26 February 2016, a les 11:40:14, Jeroen Ooms va escriure:
> We are using poppler for parsing and indexing scientific articles. For
> this purpose I wrote some bindings to poppler-cpp for the R
> programming language. A few questions:
> 
>  - Many of our pdf files give parsing errors, such as "Failed to get
> object num from hint tables" or "Expected the optional content group
> list, but wasn't able to find it" or "insufficient arguments for
> Marked Content". Examples of problematic pdf files are here:
> https://github.com/sckott/pdftoolspdfs. Are all of these pdf files
> corrupted or are these limitations in poppler? Each of these files
> seem to open just fine in any pdf reader.

Do they open wrong in poppler based readers?

Cheers,
  Albert

> 
> - Is there any sensible way to extract tabular data from pdf documents
> in a machine readable form (such as xml or csv or html)? I noticed
> that pdftotext with the -layout option does a really nice job
> positioning the table contents so I suppose poppler must have picked
> up on the table internally?
> _______________________________________________
> poppler mailing list
> poppler at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler