[poppler] pdftotext feature request: user-specified toUnicode-like tables

Tue Jun 11 16:45:38 PDT 2013

On 6/12/13, Albert Astals Cid <aacid at kde.org> wrote:
> El Dimecres, 12 de juny de 2013, a les 00:34:26, Ihar `Philips` Filipau va
> escriure:
>> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>> > Yes, indicating words is an advantage - but failing to indicate that a
>> > given character in a word is in a given font is a bug.
>>
>> This is about right time to tell you the thing: focus of poppler is
>> the on-screen representation of the PDF, not helping extracting
>> information from the PDFs. Otherwise, year ago I would have flooded
>> the place with patches. :D
>
> Well, that's a self fulfilling profecy, you think that area is not important
> so you never send the patches and that area never gets love and the circle
> never ends.
>

It's not like that. Was probably a bad choice of words on my part.

The crux of the problem is that PDFs which are easy to convert, do not
require any special attention and even pdftohtml does a very decent
job for them. But if PDF has a conversion problem, then there is no
generic way to work it around.

Otherwise, yes, I had some ideas (and even a sketch) about a generic
API to represent various bits of raw information from PDF into a
DOM-like structure. But there are several problems with the approach:

1. DOM doesn't fit well the paginated nature of PDF. Generic nature of
such API would also sacrifice quite a lot of efficiency, both CPU-wise
and RAM-wise.

2. It is an API, and as such is useless to end-users. (And even to the
most developers, raw bits of PDF would be way too low-level and not
immediately useful.)

3. For many use-cases, it is redundant thanks to the plethora of
pdf2xml tools laying around on the web. Reading programmatically XML
is easy and already provides a form of in-memory DOM of PDF for an
application.

4. Needless to mention, I simply lack sufficient PDF and poppler
knowledge to actually implement it to some level of usefulness.
(That's why I did try asking on the list some technical questions. Got
no responses.)

I have tried to go with the sketch as far as I could, expecting to hit
the #4. But instead it was the combination of #2 and #3 which
persuaded me to abandon it: all people asking related questions on the
list are users, not developers; DOM in some fashion is already
available using the assortment of pdf2xml tools.

wbr.