[poppler] alternatives to pdftohtml to extract text with formatting

Thu Apr 19 16:26:17 PDT 2012

On 4/20/12, Martin Schröder <martin at oneiros.de> wrote:
> 2012/4/20 Ihar `Philips` Filipau <thephilips at gmail.com>:
>> N.B. That's by the way the reasons for the question earlier: can I get
>> somehow formatted text from Okular via Copy/Paste or not? I'd love to
>> be able to open Okular/etc, press "Select All", "Copy", switch to OO
>> Writer and press "Paste". But that simply doesn't work.
>
> It would work if the PDF would be properly tagged and Okular would
> handle tagged content. Everything else is just some kind of OCR. :-)
>

What that means - "properly tagged"?

Or probably other away around: which producers create "properly tagged" PDFs?

I have tried a number of PDFs, produced mainly by Adobe tools
(Distiller, PDF Writer, PScriptNN.dll, ADOBEPSn.DRV) but also with
something called "FineReader" and "5D PDF Creator" - and text of
neither of them is copied into clipboard with formatting by Okular (on
Linux or FoxItReader on Windows). LibreOffice's "Paste Special"
(regardless of OS) indicates availability in clipboard only of plain
text. I have tested mainly on italic text, but also stumbled upon few
words with bold: all were copied as plain text.

What I'm doing wrong?