[poppler] alternatives to pdftohtml to extract text with formatting

Martin Schröder martin at oneiros.de
Fri Apr 20 04:39:46 PDT 2012


2012/4/20 Ihar `Philips` Filipau <thephilips at gmail.com>:
> That stuff is too new to be broadly available. Anyway, I'm stuck with
> PDFs created in end 90s, beginning 2000s.

Then you can only do some kind of OCR. :-)

> Just tested with LibreOffice 3.5.2 & Okular 0.13.3 on Linux - no
> effect: bold and italics are lost during copy-paste.

I didn't say that Okular can handle tagged pdf.
Anyway styles like "bold" and "italic" are outside the scope of tagged pdf.

Best
   Martin


More information about the poppler mailing list