[poppler] [PATCH] Fixup LaTeX composed characters
ross.moore at mq.edu.au
Fri Mar 25 14:37:08 PDT 2011
Hi Albert and Tim,
>>>> Yes, this is an issue with pdflatex but there are 100,000s of
>>>> PDFs for which we don't have source for ...
>>> Hmmm, is it supposed to just kill the diacritic mark?
>>> R. L¨wen and B. Polster
>>> gets converted to
>>> R. Lowen and B. Polster
>>> shouldn't it be
>>> R. Löwen and B. Polster
>> It should do - can you send me this PDF?
>> I get this from TeX:
>> R. L\"owen and B. Polster => R. Löwen and B. Polster
Note that this example has a customized CMAP for each font, so is not typical of older TeX-produced PDFs. So I'm not surprised that Tim's method does not work with it.
This should just mean that there are further patterns in the output that may be able to be recognised, and replaced by the proper Unicode character, or combining character pair.
>> NB I just tried extracting from a Word-generated PDF and TextOutputDev
>> didn't see the line with the diacritic at all.
> And are you sure it's not a Word fault?
Hope this helps,
More information about the poppler