[poppler] pdftotext feature request: user-specified toUnicode-like tables

Tue Jun 11 13:43:42 PDT 2013

On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>
> Regarding #2 (pdftohtml solution): Currently, pdftohtml (version 0.22.4)
> does a poor job of indicating which character in a PDF is in which
> font.

I've been using pdftohtml almost exclusively with the "-xml" option.
There you have the font id. (But additional hack was required, since
the font comparison is pretty lax and erroneously merges similar
fonts, even if one has toUnicode table and the other doesn't.)

> The font indicated seems to be more on a per-word basis.  Some
> pdftohtml cleanup is definitely needed there.

In my case that was an advantage, since later the output was fed to
another program to recover some structure from the documents. Text
reassembly is easy in comparison and was required anyway to reverse
effects of, for example, text justification.

> Unfortunately, I am not really a C++ programmer, so minor code edits and
> rebuilds are within my skillset, but significant enhancements/rewrites
> are not.

Bummer. Anyway, it seems that I'm not able to find the precise branch
which I was using for the work. I do not think I have kept it all
inside git.

Looking at the code, I do recall that I was detecting custom encoding
by checking the GfxFont::getFontEncoding() property. But I remember
definitely that there was more to it and I was tweaking poppler for
certain documents.

> If you have PDF examples where a single glyph is represented using
> multiple character codes, that would be interesting to see - but would
> not be a problem for a remapping algorithm (and I can imagine cases
> where it would happen; in fact it essentially does happen already in
> Unicode).  Many-to-one is easy.  One-to-many would obviously be
> problematic - are you saying you've seen that too?  I thought that would
> be impossible, assuming a font-aware algorithm.

Many-to-one exclusively. The PDFs were deleted and forgotten, as soon
as I was done with them. In several cases I have also simply given up,
due to the huge waste of time the activity was. At first I have also
thought about some UTF-8, but IIRC in one of the documents, the French
'ç' (the word "façade" was used often) was represented with something
like 3 characters. But it is 2 bytes in UTF-8. Believe me, I have
tried back then many many ways, trying to reverse-engineer the
encoding of the font. (I have just found that I still have the
"tesseract" OCR installed - and my scripts which were trying to use
the OCR to rebuild the unicode map. (And I see that it was fruitless:
OCR was failing on fancy modern punctuation and bold/italic
recognition.))

BTW, I found something else: pdftotext has the "-bbox" option. Albeit,
similarly to "pdftohtml -xml", it also requires "manual" reassembly of
the text later. With a very simple hack, one can add to the output of
the "-bbox" also the font name, size and style - they are stored in
the TextWord properties. That change I have (attached), though it is
for a very old version of poppler. Probably that would help.

And good luck with your PDFs. You definitely need it.

P.S. Just for the sake of experiment. Open few of the PDFs without
encodings in a recent Acrobat Reader, "Select All", "Copy", switch to
word processor and "Paste". If the text in the word processor would
look as expected, not garbled, then you have on your hands a rare
"tagged PDF." Very unlikely, but worth a try. The "tags" allow to
store in the PDF extra information like formatted text, which Acrobat
can extract.

N.B. PDFs might have attachments. In the past, I once came across a
PDF without the font encodings - but with the source WinWord document
attached. Worth checking.

> On 6/11/2013 10:06 AM, Ihar `Philips` Filipau wrote:
>> Hi!
>>
>> #1.
>> You can't make the global per-font table, as you envision it. The
>> embedded fonts often include only required symbols, meaning that
>> embedded versions of the same font might and do differ from document
>> to document - and consequently the character codes do differ too.
>>
>> #2.
>> I worked on something similar long time ago. What I did was to modify
>> the pdftohtml to print the characters of fonts without unicode mapping
>> as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
>> right now what trick I used to differentiate the fonts.) Finally,
>> semi-manually I was replacing the codes with real characters.
>>
>>
>>> If there is a stopgap method by which I could add such info to
>>> Poppler source somewhere and then recompile (hard-coding the table),
>>> please let me know - I'm fine with that for short-term use though I
>>> think a runtime table would be much much more flexible and useful.
>> I will try to locate my sources.
>> That would at least give you hints where to plug the tables.
>> But due to #1, you shouldn't trust too much such automated conversions.
>>
>> P.S. I have also, seen the effect where single character was whyever
>> represented with *multiple* character codes. IOW, with some documents
>> character code -> unicode translation isn't possible, as it would be
>> leaving some garbage in the document.
>>
>> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>>> Hi,
>>>
>>> This is my first post to the list, and I apologize in advance for any
>>> naivete revealed by my question.  However:
>>>
>>> BACKGROUND:
>>> I have a project for which my team is extracting text from a large
>>> number (~100K) of PDF files from scientific publications.  These PDFs
>>> come from a wide variety of sources.  They often use obscure-sounding
>>> fonts for symbols, and those fonts do not seem to include toUnicode data
>>> in the PDFs themselves.  The mapping in these fonts is not obvious and
>>> needs to be determined on a case-by-case (often character-by-character
>>> when the font info is unavailable online) basis.
>>>
>>> I have been accumulating my own table of character mappings for those
>>> fonts, focusing on characters of most interest to our team (certain
>>> symbols).  I would like to be able to apply that table during
>>> text-extraction by pdftotext, but I don't see any way to do that
>>> currently.  Since complaints about obscure non-documented font/character
>>> mappings are common online, application of such a table seems like
>>> something that would be of potentially broad interest.
>>>
>>> REQUEST:
>>> Ideally, I'd like to be able to take a 3-column table (see below) that I
>>> have built and supply it to pdftotext at runtime.  The table would be
>>> applied in cases where a given character from a given font appears in a
>>> PDF, no toUnicode table is supplied in the PDF, and the character does
>>> appear in the supplied table (characters missing from the table would
>>> continue to be extracted the way pdftotext does it today - i.e.,
>>> characters missing from the table should have no effect).
>>>
>>> The table would simply be a tab-delimited 3-column file with:
>>> 1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
>>> things like NJIBIE+YMath-Pack-Four
>>> 2. font character (could supply an actual character, or a hexadecimal
>>> codepoint)
>>> 3. desired Unicode mapping (again - could be an actual character or a
>>> codepoint)
>>>
>>> Exact table format isn't a big deal, but the above info is all that
>>> should be needed.
>>>
>>> If there is *already* a way to do this in pdftotext, please let me
>>> know.  If there is a stopgap method by which I could add such info to
>>> Poppler source somewhere and then recompile (hard-coding the table),
>>> please let me know - I'm fine with that for short-term use though I
>>> think a runtime table would be much much more flexible and useful.
>>>
>>> Thanks!
>>> --Jeff Lerman
>>>
>>> _______________________________________________
>>> poppler mailing list
>>> poppler at lists.freedesktop.org
>>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>>
>>
>
>
>

-- 
Don't walk behind me, I may not lead.
Don't walk in front of me, I may not follow.
Just walk beside me and be my friend.
    -- Albert Camus (attributed to)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pdftotext-bbox.diff
Type: application/octet-stream
Size: 1483 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/poppler/attachments/20130611/7da76423/attachment.obj>