[poppler] pdftotext feature request: user-specified toUnicode-like tables
Jeff Lerman
jclerman at jefflerman.net
Tue Jun 11 14:50:46 PDT 2013
On 6/11/2013 1:43 PM, Ihar `Philips` Filipau wrote:
> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>> Regarding #2 (pdftohtml solution): Currently, pdftohtml (version 0.22.4)
>> does a poor job of indicating which character in a PDF is in which
>> font.
> I've been using pdftohtml almost exclusively with the "-xml" option.
> There you have the font id. (But additional hack was required, since
> the font comparison is pretty lax and erroneously merges similar
> fonts, even if one has toUnicode table and the other doesn't.)
>
>> The font indicated seems to be more on a per-word basis. Some
>> pdftohtml cleanup is definitely needed there.
> In my case that was an advantage, since later the output was fed to
> another program to recover some structure from the documents. Text
> reassembly is easy in comparison and was required anyway to reverse
> effects of, for example, text justification.
Yes, indicating words is an advantage - but failing to indicate that a
given character in a word is in a given font is a bug. An example might
help. To extract the true information, I am using a trial version of
PDFlib TET. I ask it for an XML representation of my PDF, using --tetml
wordplus - which basically shows each word AND shows information for
each character in the word. Here is a snippet of what I get:
<Word>
<Text>3849110KbC</Text>
<Box llx="482.25" lly="438.00" urx="533.75" ury="447.00">
<Glyph font="F0" size="9" x="482.25" y="438.00" width="4.50">3</Glyph>
<Glyph font="F0" size="9" x="486.75" y="438.00" width="4.50">8</Glyph>
<Glyph font="F0" size="9" x="491.25" y="438.00" width="4.50">4</Glyph>
<Glyph font="F0" size="9" x="495.75" y="438.00" width="4.50">9</Glyph>
<Glyph font="F3" size="9" x="500.25" y="438.00" width="7.50">1</Glyph>
<Glyph font="F0" size="9" x="507.75" y="438.00" width="4.50">1</Glyph>
<Glyph font="F0" size="9" x="512.25" y="438.00" width="4.50">0</Glyph>
<Glyph font="F0" size="9" x="516.75" y="438.00" width="6.50">K</Glyph>
<Glyph font="F0" size="9" x="523.25" y="438.00" width="4.50">b</Glyph>
<Glyph font="F0" size="9" x="527.75" y="438.00" width="6.00">C</Glyph>
</Box>
</Word>
Note that the first numeral "1" is in font "F3" but the other characters
in that word are in font "F0". In this case, "F3" is the font
"MathematicalPi-One". In that font, the character encoded as "1"
actually has a glyph that looks like a plus sign. (Yuck - but I digress.)
The actual PDF, displayed by Acrobat Reader, shows this word as
"3849+10KbC".
Now, when I use pdftohtml (I'll include the actual command below too), I
get a file that includes:
pdftohtml -xml -fontfullname -s -i MYFILE.pdf
<fontspec id="7" size="11" family="Times-Bold" color="#000000"/>
<fontspec id="8" size="11" family="Times-Roman" color="#000000"/>
.
.
.
<text top="532" left="68" width="756" height="12" font="8">males,
40 females, mean age 40616 years), 45 suffering from idiopathic chronic
pancreatitis and 54 from acute recurrent pancreatitis.</text>
<text top="549" left="81" width="742" height="12"
font="7"><b>Methods. </b>Each subject was screened for the 18 CFTR
mutations: DF508, DI507, R1162X, 2183AA.G, 21303K, 3849110KbC.T,</text>
<text top="565" left="67" width="756" height="12" font="8">G542X,
1717-1G.A, R553X, Q552X, G85E, 71115G.A, 3132delTG, 278915G.A, W1282X,
R117H, R347P, R352Q), which cover</text>
as you can see, the font "MathematicalPi-One" is not noted as being the
correct one for that numeral "1". There is no way to find out the
actual fonts being used, on a per-character basis, for the text in the
PDF file. Of course, pdftotext -bbox provides no font info at all.
So, that's what I mean about pdftohtml being buggy - it provides an
unreliable indication of which font was used for each character.
>> Unfortunately, I am not really a C++ programmer, so minor code edits and
>> rebuilds are within my skillset, but significant enhancements/rewrites
>> are not.
> Bummer. Anyway, it seems that I'm not able to find the precise branch
> which I was using for the work. I do not think I have kept it all
> inside git.
>
> Looking at the code, I do recall that I was detecting custom encoding
> by checking the GfxFont::getFontEncoding() property. But I remember
> definitely that there was more to it and I was tweaking poppler for
> certain documents.
Hmm, OK. I'm a little concerned, looking at the code, that assumptions
about how to map a character from a given font are made on a whole-font
basis, not per-character. I'm not sure if there is support for fallback
mechanisms in the algorithms that convert a PDF character to Unicode for
pdftotext. For example, if a document has font X and I know that
character A in that font should be remapped to Z, but I have no
information on some other character B, I want to be able to specify the
A->Z remapping without affecting whatever default is used to show the B
character. I'm not sure if the code simply looks for the existence of a
certain kind of translation table for each font and then assumes that
the table is always complete - that would be sub-optimal for my
use-case. Can someone shed light on that question?
>> If you have PDF examples where a single glyph is represented using
>> multiple character codes, that would be interesting to see - but would
>> not be a problem for a remapping algorithm (and I can imagine cases
>> where it would happen; in fact it essentially does happen already in
>> Unicode). Many-to-one is easy. One-to-many would obviously be
>> problematic - are you saying you've seen that too? I thought that would
>> be impossible, assuming a font-aware algorithm.
> Many-to-one exclusively. The PDFs were deleted and forgotten, as soon
> as I was done with them. In several cases I have also simply given up,
> due to the huge waste of time the activity was. At first I have also
> thought about some UTF-8, but IIRC in one of the documents, the French
> 'ç' (the word "façade" was used often) was represented with something
> like 3 characters. But it is 2 bytes in UTF-8. Believe me, I have
Note that accented characters can be represented several valid ways in
Unicode - sometimes with a combined character, other times by using
separate representations of the character and the accent. It might be
that you were seeing 1 byte for "c" and 2 bytes for the "cedille" (sp?)
accent. Not the most concise representation, but totally valid.
> tried back then many many ways, trying to reverse-engineer the
> encoding of the font. (I have just found that I still have the
> "tesseract" OCR installed - and my scripts which were trying to use
> the OCR to rebuild the unicode map. (And I see that it was fruitless:
> OCR was failing on fancy modern punctuation and bold/italic
> recognition.))
>
>
> BTW, I found something else: pdftotext has the "-bbox" option. Albeit,
> similarly to "pdftohtml -xml", it also requires "manual" reassembly of
> the text later. With a very simple hack, one can add to the output of
> the "-bbox" also the font name, size and style - they are stored in
> the TextWord properties. That change I have (attached), though it is
> for a very old version of poppler. Probably that would help.
Thanks! But: Does your hack show the font name etc for each character
in each word, or just a value for the word? The former is what I'd need...
>
> And good luck with your PDFs. You definitely need it.
>
>
> P.S. Just for the sake of experiment. Open few of the PDFs without
> encodings in a recent Acrobat Reader, "Select All", "Copy", switch to
> word processor and "Paste". If the text in the word processor would
> look as expected, not garbled, then you have on your hands a rare
> "tagged PDF." Very unlikely, but worth a try. The "tags" allow to
> store in the PDF extra information like formatted text, which Acrobat
> can extract.
Right, thanks; I've done that. No go. Even if some of my PDFs are
tagged, the vast majority are not - they come from a wide range of
publishers and vintages.
Best,
--Jeff
>
> N.B. PDFs might have attachments. In the past, I once came across a
> PDF without the font encodings - but with the source WinWord document
> attached. Worth checking.
>
>
>> On 6/11/2013 10:06 AM, Ihar `Philips` Filipau wrote:
>>> Hi!
>>>
>>> #1.
>>> You can't make the global per-font table, as you envision it. The
>>> embedded fonts often include only required symbols, meaning that
>>> embedded versions of the same font might and do differ from document
>>> to document - and consequently the character codes do differ too.
>>>
>>> #2.
>>> I worked on something similar long time ago. What I did was to modify
>>> the pdftohtml to print the characters of fonts without unicode mapping
>>> as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
>>> right now what trick I used to differentiate the fonts.) Finally,
>>> semi-manually I was replacing the codes with real characters.
>>>
>>>
>>>> If there is a stopgap method by which I could add such info to
>>>> Poppler source somewhere and then recompile (hard-coding the table),
>>>> please let me know - I'm fine with that for short-term use though I
>>>> think a runtime table would be much much more flexible and useful.
>>> I will try to locate my sources.
>>> That would at least give you hints where to plug the tables.
>>> But due to #1, you shouldn't trust too much such automated conversions.
>>>
>>> P.S. I have also, seen the effect where single character was whyever
>>> represented with *multiple* character codes. IOW, with some documents
>>> character code -> unicode translation isn't possible, as it would be
>>> leaving some garbage in the document.
>>>
>>> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>>>> Hi,
>>>>
>>>> This is my first post to the list, and I apologize in advance for any
>>>> naivete revealed by my question. However:
>>>>
>>>> BACKGROUND:
>>>> I have a project for which my team is extracting text from a large
>>>> number (~100K) of PDF files from scientific publications. These PDFs
>>>> come from a wide variety of sources. They often use obscure-sounding
>>>> fonts for symbols, and those fonts do not seem to include toUnicode data
>>>> in the PDFs themselves. The mapping in these fonts is not obvious and
>>>> needs to be determined on a case-by-case (often character-by-character
>>>> when the font info is unavailable online) basis.
>>>>
>>>> I have been accumulating my own table of character mappings for those
>>>> fonts, focusing on characters of most interest to our team (certain
>>>> symbols). I would like to be able to apply that table during
>>>> text-extraction by pdftotext, but I don't see any way to do that
>>>> currently. Since complaints about obscure non-documented font/character
>>>> mappings are common online, application of such a table seems like
>>>> something that would be of potentially broad interest.
>>>>
>>>> REQUEST:
>>>> Ideally, I'd like to be able to take a 3-column table (see below) that I
>>>> have built and supply it to pdftotext at runtime. The table would be
>>>> applied in cases where a given character from a given font appears in a
>>>> PDF, no toUnicode table is supplied in the PDF, and the character does
>>>> appear in the supplied table (characters missing from the table would
>>>> continue to be extracted the way pdftotext does it today - i.e.,
>>>> characters missing from the table should have no effect).
>>>>
>>>> The table would simply be a tab-delimited 3-column file with:
>>>> 1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
>>>> things like NJIBIE+YMath-Pack-Four
>>>> 2. font character (could supply an actual character, or a hexadecimal
>>>> codepoint)
>>>> 3. desired Unicode mapping (again - could be an actual character or a
>>>> codepoint)
>>>>
>>>> Exact table format isn't a big deal, but the above info is all that
>>>> should be needed.
>>>>
>>>> If there is *already* a way to do this in pdftotext, please let me
>>>> know. If there is a stopgap method by which I could add such info to
>>>> Poppler source somewhere and then recompile (hard-coding the table),
>>>> please let me know - I'm fine with that for short-term use though I
>>>> think a runtime table would be much much more flexible and useful.
>>>>
>>>> Thanks!
>>>> --Jeff Lerman
>>>>
>>>> _______________________________________________
>>>> poppler mailing list
>>>> poppler at lists.freedesktop.org
>>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>>
>>
>>
>
More information about the poppler
mailing list