[poppler] alternatives to pdftohtml to extract text with formatting
Ihar `Philips` Filipau
thephilips at gmail.com
Thu Apr 19 17:32:17 PDT 2012
On 4/20/12, Ihar `Philips` Filipau <thephilips at gmail.com> wrote:
> On 4/20/12, Martin Schröder <martin at oneiros.de> wrote:
>> 2012/4/20 Ihar `Philips` Filipau <thephilips at gmail.com>:
>>> N.B. That's by the way the reasons for the question earlier: can I get
>>> somehow formatted text from Okular via Copy/Paste or not? I'd love to
>>> be able to open Okular/etc, press "Select All", "Copy", switch to OO
>>> Writer and press "Paste". But that simply doesn't work.
>> It would work if the PDF would be properly tagged and Okular would
>> handle tagged content. Everything else is just some kind of OCR. :-)
> What that means - "properly tagged"?
> Or probably other away around: which producers create "properly tagged"
> I have tried a number of PDFs, produced mainly by Adobe tools
> (Distiller, PDF Writer, PScriptNN.dll, ADOBEPSn.DRV) but also with
> something called "FineReader" and "5D PDF Creator" - and text of
> neither of them is copied into clipboard with formatting by Okular (on
> Linux or FoxItReader on Windows). LibreOffice's "Paste Special"
> (regardless of OS) indicates availability in clipboard only of plain
> text. I have tested mainly on italic text, but also stumbled upon few
> words with bold: all were copied as plain text.
> What I'm doing wrong?
I have installed the Adobe Reader X on one of my Windows VMs - this
one does extract formatting. In LO "Paste Special" the "Formatted Text
(RTF)" appears as an option.
But nothing like that with Okular (version 0.13.3, KDE 4.7.4) on Linux.
More information about the poppler