[poppler] pdftotext feature request: user-specified toUnicode-like tables

Tue Jun 11 14:50:46 PDT 2013

On 6/11/2013 1:43 PM, Ihar `Philips` Filipau wrote:
> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>> Regarding #2 (pdftohtml solution): Currently, pdftohtml (version 0.22.4)
>> does a poor job of indicating which character in a PDF is in which
>> font.
> I've been using pdftohtml almost exclusively with the "-xml" option.
> There you have the font id. (But additional hack was required, since
> the font comparison is pretty lax and erroneously merges similar
> fonts, even if one has toUnicode table and the other doesn't.)
>
>> The font indicated seems to be more on a per-word basis.  Some
>> pdftohtml cleanup is definitely needed there.
> In my case that was an advantage, since later the output was fed to
> another program to recover some structure from the documents. Text
> reassembly is easy in comparison and was required anyway to reverse
> effects of, for example, text justification.
Yes, indicating words is an advantage - but failing to indicate that a 
given character in a word is in a given font is a bug.  An example might 
help.  To extract the true information, I am using a trial version of 
PDFlib TET.  I ask it for an XML representation of my PDF, using --tetml 
wordplus - which basically shows each word AND shows information for 
each character in the word.  Here is a snippet of what I get:

  <Word>
   <Text>3849110KbC</Text>
   <Box llx="482.25" lly="438.00" urx="533.75" ury="447.00">
    <Glyph font="F0" size="9" x="482.25" y="438.00" width="4.50">3</Glyph>
    <Glyph font="F0" size="9" x="486.75" y="438.00" width="4.50">8</Glyph>
    <Glyph font="F0" size="9" x="491.25" y="438.00" width="4.50">4</Glyph>
    <Glyph font="F0" size="9" x="495.75" y="438.00" width="4.50">9</Glyph>
    <Glyph font="F3" size="9" x="500.25" y="438.00" width="7.50">1</Glyph>
    <Glyph font="F0" size="9" x="507.75" y="438.00" width="4.50">1</Glyph>
    <Glyph font="F0" size="9" x="512.25" y="438.00" width="4.50">0</Glyph>
    <Glyph font="F0" size="9" x="516.75" y="438.00" width="6.50">K</Glyph>
    <Glyph font="F0" size="9" x="523.25" y="438.00" width="4.50">b</Glyph>
    <Glyph font="F0" size="9" x="527.75" y="438.00" width="6.00">C</Glyph>
   </Box>
  </Word>

Note that the first numeral "1" is in font "F3" but the other characters 
in that word are in font "F0".  In this case, "F3" is the font 
"MathematicalPi-One".  In that font, the character encoded as "1" 
actually has a glyph that looks like a plus sign. (Yuck - but I digress.)

The actual PDF, displayed by Acrobat Reader, shows this word as 
"3849+10KbC".

Now, when I use pdftohtml (I'll include the actual command below too), I 
get a file that includes:

pdftohtml -xml -fontfullname -s -i MYFILE.pdf

     <fontspec id="7" size="11" family="Times-Bold" color="#000000"/>
     <fontspec id="8" size="11" family="Times-Roman" color="#000000"/>
     .
     .
     .
     <text top="532" left="68" width="756" height="12" font="8">males, 
40 females, mean age 40616 years), 45 suffering from idiopathic chronic 
pancreatitis and 54 from acute recurrent pancreatitis.</text>
     <text top="549" left="81" width="742" height="12" 
font="7"><b>Methods. </b>Each subject was screened for the 18 CFTR 
mutations: DF508, DI507, R1162X, 2183AA.G, 21303K, 3849110KbC.T,</text>
     <text top="565" left="67" width="756" height="12" font="8">G542X, 
1717-1G.A, R553X, Q552X, G85E, 71115G.A, 3132delTG, 278915G.A, W1282X, 
R117H, R347P, R352Q), which cover</text>

as you can see, the font "MathematicalPi-One" is not noted as being the 
correct one for that numeral "1".  There is no way to find out the 
actual fonts being used, on a per-character basis, for the text in the 
PDF file.  Of course, pdftotext -bbox provides no font info at all.

So, that's what I mean about pdftohtml being buggy - it provides an 
unreliable indication of which font was used for each character.

>> Unfortunately, I am not really a C++ programmer, so minor code edits and
>> rebuilds are within my skillset, but significant enhancements/rewrites
>> are not.
> Bummer. Anyway, it seems that I'm not able to find the precise branch
> which I was using for the work. I do not think I have kept it all
> inside git.
>
> Looking at the code, I do recall that I was detecting custom encoding
> by checking the GfxFont::getFontEncoding() property. But I remember
> definitely that there was more to it and I was tweaking poppler for
> certain documents.

Hmm, OK.  I'm a little concerned, looking at the code, that assumptions 
about how to map a character from a given font are made on a whole-font 
basis, not per-character.  I'm not sure if there is support for fallback 
mechanisms in the algorithms that convert a PDF character to Unicode for 
pdftotext.  For example, if a document has font X and I know that 
character A in that font should be remapped to Z, but I have no 
information on some other character B, I want to be able to specify the 
A->Z remapping without affecting whatever default is used to show the B 
character.  I'm not sure if the code simply looks for the existence of a 
certain kind of translation table for each font and then assumes that 
the table is always complete - that would be sub-optimal for my 
use-case.  Can someone shed light on that question?

>> If you have PDF examples where a single glyph is represented using
>> multiple character codes, that would be interesting to see - but would
>> not be a problem for a remapping algorithm (and I can imagine cases
>> where it would happen; in fact it essentially does happen already in
>> Unicode).  Many-to-one is easy.  One-to-many would obviously be
>> problematic - are you saying you've seen that too?  I thought that would
>> be impossible, assuming a font-aware algorithm.
> Many-to-one exclusively. The PDFs were deleted and forgotten, as soon
> as I was done with them. In several cases I have also simply given up,
> due to the huge waste of time the activity was. At first I have also
> thought about some UTF-8, but IIRC in one of the documents, the French
> 'ç' (the word "façade" was used often) was represented with something
> like 3 characters. But it is 2 bytes in UTF-8. Believe me, I have
Note that accented characters can be represented several valid ways in 
Unicode - sometimes with a combined character, other times by using 
separate representations of the character and the accent.  It might be 
that you were seeing 1 byte for "c" and 2 bytes for the "cedille" (sp?) 
accent.  Not the most concise representation, but totally valid.
> tried back then many many ways, trying to reverse-engineer the
> encoding of the font. (I have just found that I still have the
> "tesseract" OCR installed - and my scripts which were trying to use
> the OCR to rebuild the unicode map. (And I see that it was fruitless:
> OCR was failing on fancy modern punctuation and bold/italic
> recognition.))
>
>
> BTW, I found something else: pdftotext has the "-bbox" option. Albeit,
> similarly to "pdftohtml -xml", it also requires "manual" reassembly of
> the text later. With a very simple hack, one can add to the output of
> the "-bbox" also the font name, size and style - they are stored in
> the TextWord properties. That change I have (attached), though it is
> for a very old version of poppler. Probably that would help.

Thanks!  But: Does your hack show the font name etc for each character 
in each word, or just a value for the word?  The former is what I'd need...

>
> And good luck with your PDFs. You definitely need it.
>
>
> P.S. Just for the sake of experiment. Open few of the PDFs without
> encodings in a recent Acrobat Reader, "Select All", "Copy", switch to
> word processor and "Paste". If the text in the word processor would
> look as expected, not garbled, then you have on your hands a rare
> "tagged PDF." Very unlikely, but worth a try. The "tags" allow to
> store in the PDF extra information like formatted text, which Acrobat
> can extract.
Right, thanks; I've done that.  No go.  Even if some of my PDFs are 
tagged, the vast majority are not - they come from a wide range of 
publishers and vintages.

Best,
--Jeff
>
> N.B. PDFs might have attachments. In the past, I once came across a
> PDF without the font encodings - but with the source WinWord document
> attached. Worth checking.
>
>
>> On 6/11/2013 10:06 AM, Ihar `Philips` Filipau wrote:
>>> Hi!
>>>
>>> #1.
>>> You can't make the global per-font table, as you envision it. The
>>> embedded fonts often include only required symbols, meaning that
>>> embedded versions of the same font might and do differ from document
>>> to document - and consequently the character codes do differ too.
>>>
>>> #2.
>>> I worked on something similar long time ago. What I did was to modify
>>> the pdftohtml to print the characters of fonts without unicode mapping
>>> as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
>>> right now what trick I used to differentiate the fonts.) Finally,
>>> semi-manually I was replacing the codes with real characters.
>>>
>>>
>>>> If there is a stopgap method by which I could add such info to
>>>> Poppler source somewhere and then recompile (hard-coding the table),
>>>> please let me know - I'm fine with that for short-term use though I
>>>> think a runtime table would be much much more flexible and useful.
>>> I will try to locate my sources.
>>> That would at least give you hints where to plug the tables.
>>> But due to #1, you shouldn't trust too much such automated conversions.
>>>
>>> P.S. I have also, seen the effect where single character was whyever
>>> represented with *multiple* character codes. IOW, with some documents
>>> character code -> unicode translation isn't possible, as it would be
>>> leaving some garbage in the document.
>>>
>>> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>>>> Hi,
>>>>
>>>> This is my first post to the list, and I apologize in advance for any
>>>> naivete revealed by my question.  However:
>>>>
>>>> BACKGROUND:
>>>> I have a project for which my team is extracting text from a large
>>>> number (~100K) of PDF files from scientific publications.  These PDFs
>>>> come from a wide variety of sources.  They often use obscure-sounding
>>>> fonts for symbols, and those fonts do not seem to include toUnicode data
>>>> in the PDFs themselves.  The mapping in these fonts is not obvious and
>>>> needs to be determined on a case-by-case (often character-by-character
>>>> when the font info is unavailable online) basis.
>>>>
>>>> I have been accumulating my own table of character mappings for those
>>>> fonts, focusing on characters of most interest to our team (certain
>>>> symbols).  I would like to be able to apply that table during
>>>> text-extraction by pdftotext, but I don't see any way to do that
>>>> currently.  Since complaints about obscure non-documented font/character
>>>> mappings are common online, application of such a table seems like
>>>> something that would be of potentially broad interest.
>>>>
>>>> REQUEST:
>>>> Ideally, I'd like to be able to take a 3-column table (see below) that I
>>>> have built and supply it to pdftotext at runtime.  The table would be
>>>> applied in cases where a given character from a given font appears in a
>>>> PDF, no toUnicode table is supplied in the PDF, and the character does
>>>> appear in the supplied table (characters missing from the table would
>>>> continue to be extracted the way pdftotext does it today - i.e.,
>>>> characters missing from the table should have no effect).
>>>>
>>>> The table would simply be a tab-delimited 3-column file with:
>>>> 1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
>>>> things like NJIBIE+YMath-Pack-Four
>>>> 2. font character (could supply an actual character, or a hexadecimal
>>>> codepoint)
>>>> 3. desired Unicode mapping (again - could be an actual character or a
>>>> codepoint)
>>>>
>>>> Exact table format isn't a big deal, but the above info is all that
>>>> should be needed.
>>>>
>>>> If there is *already* a way to do this in pdftotext, please let me
>>>> know.  If there is a stopgap method by which I could add such info to
>>>> Poppler source somewhere and then recompile (hard-coding the table),
>>>> please let me know - I'm fine with that for short-term use though I
>>>> think a runtime table would be much much more flexible and useful.
>>>>
>>>> Thanks!
>>>> --Jeff Lerman
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>
>>
>