[poppler] pdftohtml lets you run random shell commands

Thu Apr 19 15:14:36 PDT 2012

On 4/19/12, Albert Astals Cid <aacid at kde.org> wrote:
> El Dijous, 19 d'abril de 2012, a les 14:43:56, William Bader va escriure:
>> I don't understand why converting pdf to html requires gs to rasterize to
>> png, especially when pdftoppm can generate png.
>
> Me either, we inherited pdftohtml from some other dead project via krh
> forcing
> it on us when he was the maintainer, if it had been my decision pdftohtml
> would not be part of poppler, it's code quality is worse than the rest of
> poppler.
>
> I'm all for removing it, but that might bring some unwanted dead threats
>

I will not send death threats, I promise ;)

Look at from another side: this is the only somehow maintained version
of pdftohtml.

AND pdf2html/pdf2xml is the only way at the moment to get (most of)
the text formatting out of the PDFs.

I personally would love to have an alternative, but there are none.

N.B. That's by the way the reasons for the question earlier: can I get
somehow formatted text from Okular via Copy/Paste or not? I'd love to
be able to open Okular/etc, press "Select All", "Copy", switch to OO
Writer and press "Paste". But that simply doesn't work.

But I guess 99% of the crowd here is interested solely in how PDF
looks on the screen - not on how to reverse engineer the information
back from it. So lack of interest isn't surprising to me.

If there are any particular ideas on how to improve the code quality,
at this moment of time I'm open for suggestions.
But frankly, the code has to be "forked" first: integration of Splash
added another use case and untangling all the dependencies would be a
hell of a work: there are 2 HTML modes, Splash mode and XML mode.
If I were starting to clean-up the code, I would first copy-paste the
pdftohtml/HtmlOutputDev into three different applications: pdftohtml,
pdftosplash, pdftoxml - and start by removing redundant stuff and
looking for what can be reused. Better half of the code is copy-paste
from (old version of) TextOutputDev anyway.
But the problem here is the way code is organized, and I mean also
code of the poppler itself. I can't for example reuse the code from
TextOutputDev in HtmlOutputDev, because there is no stable in-memory
presentation of the PDF, all OutputDevs have to invent their own. Thus
all the logic and algorithms they already implement are not reusable.
All said before is result of a rather cursory code review and I'd love
to be shown wrong, e.g. by giving me at least shallow instructions of
how can one implement pdftoxml using poppler's cpp interface. Last
time I was looking, I have stopped at poppler::page::text(), the only
method I have found which provides access to page's text - but it
returns it without any formatting or font information or text
coordinates and thus is useless. I have found no DOM-like object
giving me access to PDF's innards. So one is back to PDFDoc, back to
reinventing another OutputDev...