[poppler] alternatives to pdftohtml to extract text with formatting

Martin Schröder martin at oneiros.de
Thu Apr 19 23:32:29 PDT 2012


2012/4/20 Ihar `Philips` Filipau <thephilips at gmail.com>:
> What that means - "properly tagged"?

Conforming to PDF/A-1a. or PDF/UA.
See Section 14.8 of 32000-1:2008.
https://en.wikipedia.org/wiki/PDF#Logical_structure_and_accessibility

> Or probably other away around: which producers create "properly tagged" PDFs?

AFAIK LibreOffice, Word, ConTeXt can do that.

Best
   Martin


More information about the poppler mailing list