[poppler] alternatives to pdftohtml to extract text with formatting

Ihar `Philips` Filipau thephilips at gmail.com
Fri Apr 20 04:33:18 PDT 2012


On 4/20/12, Martin Schröder <martin at oneiros.de> wrote:
> 2012/4/20 Ihar `Philips` Filipau <thephilips at gmail.com>:
>> What that means - "properly tagged"?
>
> Conforming to PDF/A-1a. or PDF/UA.
> See Section 14.8 of 32000-1:2008.
> https://en.wikipedia.org/wiki/PDF#Logical_structure_and_accessibility

That stuff is too new to be broadly available. Anyway, I'm stuck with
PDFs created in end 90s, beginning 2000s.

>> Or probably other away around: which producers create "properly tagged"
>> PDFs?
>
> AFAIK LibreOffice, Word, ConTeXt can do that.
>

Just tested with LibreOffice 3.5.2 & Okular 0.13.3 on Linux - no
effect: bold and italics are lost during copy-paste.

wbr.


More information about the poppler mailing list