[poppler] pdftotext feature request: user-specified toUnicode-like tables

Jeff Lerman jclerman at jefflerman.net
Tue Jun 11 09:42:46 PDT 2013


Hi,

This is my first post to the list, and I apologize in advance for any 
naivete revealed by my question.  However:

BACKGROUND:
I have a project for which my team is extracting text from a large 
number (~100K) of PDF files from scientific publications.  These PDFs 
come from a wide variety of sources.  They often use obscure-sounding 
fonts for symbols, and those fonts do not seem to include toUnicode data 
in the PDFs themselves.  The mapping in these fonts is not obvious and 
needs to be determined on a case-by-case (often character-by-character 
when the font info is unavailable online) basis.

I have been accumulating my own table of character mappings for those 
fonts, focusing on characters of most interest to our team (certain 
symbols).  I would like to be able to apply that table during 
text-extraction by pdftotext, but I don't see any way to do that 
currently.  Since complaints about obscure non-documented font/character 
mappings are common online, application of such a table seems like 
something that would be of potentially broad interest.

REQUEST:
Ideally, I'd like to be able to take a 3-column table (see below) that I 
have built and supply it to pdftotext at runtime.  The table would be 
applied in cases where a given character from a given font appears in a 
PDF, no toUnicode table is supplied in the PDF, and the character does 
appear in the supplied table (characters missing from the table would 
continue to be extracted the way pdftotext does it today - i.e., 
characters missing from the table should have no effect).

The table would simply be a tab-delimited 3-column file with:
1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT 
things like NJIBIE+YMath-Pack-Four
2. font character (could supply an actual character, or a hexadecimal 
codepoint)
3. desired Unicode mapping (again - could be an actual character or a 
codepoint)

Exact table format isn't a big deal, but the above info is all that 
should be needed.

If there is *already* a way to do this in pdftotext, please let me 
know.  If there is a stopgap method by which I could add such info to 
Poppler source somewhere and then recompile (hard-coding the table), 
please let me know - I'm fine with that for short-term use though I 
think a runtime table would be much much more flexible and useful.

Thanks!
--Jeff Lerman



More information about the poppler mailing list