[poppler] pdftotext feature request: user-specified toUnicode-like tables

Tue Jun 11 15:34:26 PDT 2013

On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
> Yes, indicating words is an advantage - but failing to indicate that a
> given character in a word is in a given font is a bug.

This is about right time to tell you the thing: focus of poppler is
the on-screen representation of the PDF, not helping extracting
information from the PDFs. Otherwise, year ago I would have flooded
the place with patches. :D

To paint a character, one doesn't need to know its Unicode - the raw
code point is an index of the font's glyphs/etc for the character. The
Unicode of a character is only needed for copy-pasting. (Some PDF
software intentionally strips the Unicode mapping tables to make
copy-paste/text extraction unusable.)

Otherwise, for you it is worth googling "pdf2htmlEX" and/or
"pdftohtmlEX". Search for the precise terms. There are several project
on the net (one of them is definitely based on poppler) focusing on
extracting text/etc from PDFs into HTML, with high level of fidelity.
Probably that would be more helpful to you than forcing poppler do
something it is not designed to do.

> Now, when I use pdftohtml (I'll include the actual command below too), I
> get a file that includes:
>  .....
> as you can see, the font "MathematicalPi-One" is not noted as being the
> correct one for that numeral "1".  There is no way to find out the
> actual fonts being used, on a per-character basis, for the text in the
> PDF file.

That what I meant by saying that pdftohtml erroneously merges some fonts.
But this is not per se a bug. Conversion of PDF into an HTML is at
best approximate process, primarily optimized to display an average
PDF in a readable fashion.

> Of course, pdftotext -bbox provides no font info at all.

That what my patch amends. But see below.

> So, that's what I mean about pdftohtml being buggy - it provides an
> unreliable indication of which font was used for each character.

It's not buggy - it is not designed for the purpose.

In fact, modulo "Tagged PDF" feature, PDF is not designed to represent
text, per se. The most common PDF is just a container with vector
graphics. Some of the graphics is drawing of text. Extraction of text
is literally, based on interception of a text drawing operation and
instead of drawing the text, dumping it into a file/etc.

@Leonard, please don't hit me. /me *cowers*. :D

> Hmm, OK.  I'm a little concerned, looking at the code, that assumptions
> about how to map a character from a given font are made on a whole-font
> basis, not per-character.
> I'm not sure if there is support for fallback
> mechanisms in the algorithms that convert a PDF character to Unicode for
> pdftotext.  For example, if a document has font X and I know that
> character A in that font should be remapped to Z, but I have no
> information on some other character B, I want to be able to specify the
> A->Z remapping without affecting whatever default is used to show the B
> character.  I'm not sure if the code simply looks for the existence of a
> certain kind of translation table for each font and then assumes that
> the table is always complete - that would be sub-optimal for my
> use-case.  Can someone shed light on that question?

The toUnicode table is per-font. But, for example normal, bold, italic
and bold+italic fonts are 4 different fonts. That is why the merge is
needed for HTML.

There should be already a place to hook the Unicode mapping table,
because there is already place in code (I've seen it once) which
extracts from PDF the font specific Unicode mapping table.

But that requires coding, the coding which is not relevant (at least
not at the moment) to the poppler project.

> Thanks!  But: Does your hack show the font name etc for each character
> in each word, or just a value for the word?  The former is what I'd need...

I don't remember, sorry. Probably not, because those "words" are used
to implement the search functions in the PDF viewers and as such, have
different priorities (main prio: recognizing sequence of letters as a
word). IOW, presence of the font information there is purely
accidental, but was helpful to me.

>
> Best,
> --Jeff

Regards.

>>
>> N.B. PDFs might have attachments. In the past, I once came across a
>> PDF without the font encodings - but with the source WinWord document
>> attached. Worth checking.
>>

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)