[poppler] pdftotext feature request: user-specified toUnicode-like tables

Tue Jun 11 10:49:25 PDT 2013

Thanks!

Regarding your #1: Yes, the embedded fonts I'm seeing often only include 
a few required symbols.  However, so far I have not seen any cases where 
a particular character in a particular named font has a different 
mapping from one PDF to the next.  I am prepared to believe that such a 
thing might happen (for a while I thought it happened often) but so far 
it seems to be rare.  The vast majority of the problems I am seeing 
could be addressed by reference to a user-constructed table like I am 
proposing - and such a table would allow a user to fix problems quickly 
for a set of PDFs that use some obscure font.  Note also that I specify 
that characters missing from my table should be handled by whatever 
default path pdftotext already uses (characters missing from the table 
should have no effect).

Regarding #2 (pdftohtml solution): Currently, pdftohtml (version 0.22.4) 
does a poor job of indicating which character in a PDF is in which 
font.  The font indicated seems to be more on a per-word basis.  Some 
pdftohtml cleanup is definitely needed there.  In the meantime, since 
what I really want is almost exactly what pdftotext provides, but doing 
a better job of remapping characters in a way that is font-aware, I'd 
much prefer a solution that allows pdftotext to "do the right thing" for 
these fonts, since I already have the mapping info for the cases I care 
most about.

Unfortunately, I am not really a C++ programmer, so minor code edits and 
rebuilds are within my skillset, but significant enhancements/rewrites 
are not.

If you have PDF examples where a single glyph is represented using 
multiple character codes, that would be interesting to see - but would 
not be a problem for a remapping algorithm (and I can imagine cases 
where it would happen; in fact it essentially does happen already in 
Unicode).  Many-to-one is easy.  One-to-many would obviously be 
problematic - are you saying you've seen that too?  I thought that would 
be impossible, assuming a font-aware algorithm.

Thanks,
--Jeff

On 6/11/2013 10:06 AM, Ihar `Philips` Filipau wrote:
> Hi!
>
> #1.
> You can't make the global per-font table, as you envision it. The
> embedded fonts often include only required symbols, meaning that
> embedded versions of the same font might and do differ from document
> to document - and consequently the character codes do differ too.
>
> #2.
> I worked on something similar long time ago. What I did was to modify
> the pdftohtml to print the characters of fonts without unicode mapping
> as raw codes, in the XML/HTML notation: &#<code>; (I can't remember
> right now what trick I used to differentiate the fonts.) Finally,
> semi-manually I was replacing the codes with real characters.
>
>
>> If there is a stopgap method by which I could add such info to
>> Poppler source somewhere and then recompile (hard-coding the table),
>> please let me know - I'm fine with that for short-term use though I
>> think a runtime table would be much much more flexible and useful.
> I will try to locate my sources.
> That would at least give you hints where to plug the tables.
> But due to #1, you shouldn't trust too much such automated conversions.
>
> P.S. I have also, seen the effect where single character was whyever
> represented with *multiple* character codes. IOW, with some documents
> character code -> unicode translation isn't possible, as it would be
> leaving some garbage in the document.
>
> On 6/11/13, Jeff Lerman <jclerman at jefflerman.net> wrote:
>> Hi,
>>
>> This is my first post to the list, and I apologize in advance for any
>> naivete revealed by my question.  However:
>>
>> BACKGROUND:
>> I have a project for which my team is extracting text from a large
>> number (~100K) of PDF files from scientific publications.  These PDFs
>> come from a wide variety of sources.  They often use obscure-sounding
>> fonts for symbols, and those fonts do not seem to include toUnicode data
>> in the PDFs themselves.  The mapping in these fonts is not obvious and
>> needs to be determined on a case-by-case (often character-by-character
>> when the font info is unavailable online) basis.
>>
>> I have been accumulating my own table of character mappings for those
>> fonts, focusing on characters of most interest to our team (certain
>> symbols).  I would like to be able to apply that table during
>> text-extraction by pdftotext, but I don't see any way to do that
>> currently.  Since complaints about obscure non-documented font/character
>> mappings are common online, application of such a table seems like
>> something that would be of potentially broad interest.
>>
>> REQUEST:
>> Ideally, I'd like to be able to take a 3-column table (see below) that I
>> have built and supply it to pdftotext at runtime.  The table would be
>> applied in cases where a given character from a given font appears in a
>> PDF, no toUnicode table is supplied in the PDF, and the character does
>> appear in the supplied table (characters missing from the table would
>> continue to be extracted the way pdftotext does it today - i.e.,
>> characters missing from the table should have no effect).
>>
>> The table would simply be a tab-delimited 3-column file with:
>> 1. fontname, e.g. AdvP4C4E74 or AdvPi1 or YMath-Pack-Four, but NOT
>> things like NJIBIE+YMath-Pack-Four
>> 2. font character (could supply an actual character, or a hexadecimal
>> codepoint)
>> 3. desired Unicode mapping (again - could be an actual character or a
>> codepoint)
>>
>> Exact table format isn't a big deal, but the above info is all that
>> should be needed.
>>
>> If there is *already* a way to do this in pdftotext, please let me
>> know.  If there is a stopgap method by which I could add such info to
>> Poppler source somewhere and then recompile (hard-coding the table),
>> please let me know - I'm fine with that for short-term use though I
>> think a runtime table would be much much more flexible and useful.
>>
>> Thanks!
>> --Jeff Lerman
>>
>> _______________________________________________
>> poppler mailing list
>> poppler at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/poppler
>>
>