[poppler] garbled text from pdftohtml/pdftotext

Fri Apr 13 03:15:35 PDT 2012

Hi!

Few findings, for anybody interested in recovering information from
PDFs with embedded fonts with wrong or missing mapping to unicode
table.

I have stumbled on another such PDF, and decided to take a closer look
at the insane idea of OCRing the texts.

#1. As it turned out, the idea is not so insane. Probably just crazy.
If I'm reading forums correctly, in the last year the situation has
changed: libtesseract got pretty stable and can even recognize italic
formatting. Tesseract is pretty old - but rather good - OCR tool
developed by HP. Some years ago it was open-sourced and one of the
fruits of open-sourcing it, was the repackaging it as a library. (Home
page: http://code.google.com/p/tesseract-ocr/ . On Debian systems:
`apt-get install libtesseract-dev` - but I haven't checked it in depth
myself.)

With libtesseract in mind, it should be feasible to attempt the text
conversion, even if neither the mapping to Unicode is supplied, nor
the embedded font has the mapping table: use freetype to render the
text using the embedded font into in-memory bitmap, feed the bitmap to
OCR library, match the OCR'ed text with the input string to build a
mapping table. The mapping table is needed since OCR is pretty slow
and one should avoid calling it, if the text can be converted using
solely the mapping table. OCR should be pretty reliable, since the
input image would be properly aligned and clean of the usual
post-scanner garbage.

#2. The mapping table inside the font was actually my second finding
when I checked yesterday freetype library interface: font can contain
the charset mapping table(s). And I haven't found any trace in the
poppler (but neither am I a specialist in the innards) of any attempt
to access the font's own mapping tables. But I might be totally off
here, since I do not have any experience with the font rendering and
have only surface understanding of the purpose of the mapping tables
inside fonts.

#3. The handling of fonts inside pdftohtml (and in some parts inside
the poppler too) isn't very clean: font comparison disregards
encoding. IOW, two fonts would be considered equivalent even if they
have different encodings. That means, in some scenarios, when text is
extracted and merged into lines, the information about font encoding
is lost, if the text using the font with custom encoding is surrounded
by text with font with known encoding (and probably vice versa). I did
a fix for it in my private repo, and it has the (desired) side effect
of producing more line breaks. I have also made it so that characters
with custom encoding are extracted as hex literals - `&#x0000;`. I can
post the patch for pdftohtml if anybody's interested.

fyi.

On 3/24/12, Ihar `Philips` Filipau <thephilips at gmail.com> wrote:
> On 3/24/12, suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> wrote:
>>
>> I think so. If we restrict our scope to Latin script, there might
>> be some heuristic technology to restore the original text (I think
>> the number of unique glyphs in the document is less than 255 x 3).
>> If there are so many Latin script (or small charset) documents
>> that the texts cannot be extracted, some experts may be interested
>> in the solution for this issue. I think it's interesting theme for
>> some engineers (including me), but unfortunately, I don't have
>> sufficient sparetime to do it now, and I'm a CJK people :-).
>>
>
> Sort of solution exists already: "print" to PNG and OCR. Because
> that's what it really is: guess which symbol of the font maps to which
> character. Provided that we have only the image of the character, that
> is the job of OCR to do it.
>
> It seems I had a luck and can guess meaning of those few symbols which
> are still garbled, but it seems that other different things are going
> on too in the document, e.g. capital letter C is "C\x8a\x8dX" and
> question mark is "C at PQX".
>