[poppler] pdftotext feature request: user-specified toUnicode-like tables

Tue Jun 11 16:40:43 PDT 2013

On 6/11/2013 3:34 PM, Ihar `Philips` Filipau wrote:
> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>> Yes, indicating words is an advantage - but failing to indicate that a
>> given character in a word is in a given font is a bug.
> This is about right time to tell you the thing: focus of poppler is
> the on-screen representation of the PDF, not helping extracting
> information from the PDFs. Otherwise, year ago I would have flooded
> the place with patches. :D
On the one hand, fair point - on the other hand, pdftotext is included 
in poppler, and as Albert has pointed out, things get better when they 
get worked on.
> To paint a character, one doesn't need to know its Unicode - the raw
> code point is an index of the font's glyphs/etc for the character. The
> Unicode of a character is only needed for copy-pasting. (Some PDF
> software intentionally strips the Unicode mapping tables to make
> copy-paste/text extraction unusable.)
>
> Otherwise, for you it is worth googling "pdf2htmlEX" and/or
> "pdftohtmlEX". Search for the precise terms. There are several project
> on the net (one of them is definitely based on poppler) focusing on
> extracting text/etc from PDFs into HTML, with high level of fidelity.
> Probably that would be more helpful to you than forcing poppler do
> something it is not designed to do.
I'm looking up those packages now, and have checked out pdf2htmlEX. 
However, HTML is an unnecessary intermediate for our purposes. Also, I 
don't see evidence (happy to be corrected here) that pdf2htmlEX contains 
a table of mappings for obscure fonts, or a way for users to specify 
their own mapping tables.  Without that, it doesn't help me.
>> Now, when I use pdftohtml (I'll include the actual command below too), I
>> get a file that includes:
>>   .....
>> as you can see, the font "MathematicalPi-One" is not noted as being the
>> correct one for that numeral "1".  There is no way to find out the
>> actual fonts being used, on a per-character basis, for the text in the
>> PDF file.
> That what I meant by saying that pdftohtml erroneously merges some fonts.
> But this is not per se a bug. Conversion of PDF into an HTML is at
> best approximate process, primarily optimized to display an average
> PDF in a readable fashion.
I would say that "erroneously merges" is indeed a bug.  (Unless the 
documentation specifies that behavior - in which case it might be a 
feature, but not a great one, I'd suggest).  If the mis-assignment of 
fonts is done in pdftotext as well, and if it is done before conversion 
to Unicode, then my request (even if I choose to try coding it myself) 
will be blocked, since the per-character font information is lost 
(discarded) by pdftotext.  Does anyone have any information on whether 
that is the case?
> ... In fact, modulo "Tagged PDF" feature, PDF is not designed to 
> represent text, per se. The most common PDF is just a container with 
> vector graphics. Some of the graphics is drawing of text. Extraction 
> of text is literally, based on interception of a text drawing 
> operation and instead of drawing the text, dumping it into a file/etc. 
> @Leonard, please don't hit me. /me *cowers*. :D 
Yes, I know that PDF was not originally designed to represent 
machine-readable text, and instead is essentially optimized for 
human-readability only.  However, the fact is that in today's world some 
of us must extract text from PDFs anyway.

I stipulate that I am aware of the problem and that I am forced to face 
it anyway.
>> Hmm, OK.  I'm a little concerned, looking at the code, that assumptions
>> about how to map a character from a given font are made on a whole-font
>> basis, not per-character.
>> I'm not sure if there is support for fallback
>> mechanisms in the algorithms that convert a PDF character to Unicode for
>> pdftotext.  For example, if a document has font X and I know that
>> character A in that font should be remapped to Z, but I have no
>> information on some other character B, I want to be able to specify the
>> A->Z remapping without affecting whatever default is used to show the B
>> character.  I'm not sure if the code simply looks for the existence of a
>> certain kind of translation table for each font and then assumes that
>> the table is always complete - that would be sub-optimal for my
>> use-case.  Can someone shed light on that question?
> The toUnicode table is per-font. But, for example normal, bold, italic
> and bold+italic fonts are 4 different fonts. That is why the merge is
> needed for HTML.
>
> There should be already a place to hook the Unicode mapping table,
> because there is already place in code (I've seen it once) which
> extracts from PDF the font specific Unicode mapping table.
>

OK, here I will begin to delve into the code itself, at least to scope 
the problem.  This is important to our team and we're willing to devote 
resources to solving the problem and contributing code if people are 
willing to answer questions about aspects of the existing Poppler code.  
Happy to take some questions offline if there are particular folks who 
have the necessary expertise.

I think my feature request could be broken down like this:

1. Pass a file containing a custom set of font-specific "ToUnicode" 
mappings to pdftotext.
2. Ensure that pdftotext is correctly parsing and preserving, for each 
PDF character, the name of the font used (in the PDF) to represent it, 
as well as the character number.
3. Edit the appropriate code (in GfxFont.cc?) to "patch" the 
character-to-Unicode mapping using table supplied in step 1.  The 
mapping should be used, probably, as if it came from a "ToUnicode" table 
supplied in the PDF.

At line 1182 in GfxFont.cc, I see the comment "// merge differences into 
encoding" (preceding a code block) which seems to be where a "patch" 
table such as the one I'm proposing should be utilized. However, I have 
questions:

a. It looks like that code-block might only be used for certain fonts 
(it's in Gfx8BitFont::Gfx8BitFont).  Is that true?  If so, is there an 
analogous block for each of the other possible font-types?
b. Is there any danger of the "patches" being ignored for certain fonts 
based on the "choose a cmap" logic outlined later?  In my copy there is 
a block of comments that begin:

   // To match up with the Adobe-defined behaviour, we choose a cmap
   // like this:
   // 1. If the PDF font has an encoding:
...

Any takers for any of the above steps 1-3?

Finally: I think allowing users to specify this kind of 
character-mapping table manually at runtime would contribute enormously 
to the usefulness of pdftotext.  At the moment, the character-mapping 
issue is the most prominent problem with using an otherwise very useful 
program (THANK YOU to all who have been building/maintaining this 
suite!), and it's frustrating to have the correct mappings already 
in-hand, but not have a way to tell pdftotext to incorporate them.  If 
pdftotext *could* accept such a table, there would be strong incentive 
to crowdsource additions to the table, since they are of broad 
interest.  Since collecting the data for the table is quite 
labor-intensive, incentivizing the work by allowing it to be easily 
utilized would itself be very influential.

Thanks very much,
--Jeff

> Regards.
>
>>> N.B. PDFs might have attachments. In the past, I once came across a
>>> PDF without the font encodings - but with the source WinWord document
>>> attached. Worth checking.
>>>
>