[poppler] [PATCH] Fixup LaTeX composed characters

Jonathan Kew jfkthame at googlemail.com
Mon May 9 05:30:11 PDT 2011

On 9 May 2011, at 11:22, Tim Brody wrote:

> On Sat, 2011-05-07 at 19:02 +0100, Jonathan Kew wrote:
>> On 7 May 2011, at 17:43, Albert Astals Cid wrote:
>>> A Friday, April 01, 2011, Albert Astals Cid va escriure:
>>>> A Divendres, 1 d'abril de 2011, Tim Brody va escriure:
>>>>> On Thu, 31 Mar 2011 23:28:02 +0100, Albert Astals Cid <aacid at kde.org>
>>>>> wrote:
>>>>>> A Dimecres, 30 de març de 2011, vàreu escriure:
>>>>>>> On Tue, 2011-03-29 at 22:45 +0100, Albert Astals Cid wrote:
>>>>>>>>>> I still get
>>>>>>>>>> -R. L¨wen and B. Polster
>>>>>>>>>> -o
>>>>>>>>>> +R. Lowen and B. Polster
>>>>>>>>>> Maybe you sent a old version of the patch? Can anyone confirm if
>>>>>> My bad, somehow vi/diff/less are showing me o but if i open it in kate
>>>>>> i see
>>>>>> an ö
>>>>> That will be because it's separate characters (X + combining char). You
>>>>> could normalise with unicodeNormalizeNFKC but I thought it probably
>>>>> better to leave text - as far as possible - unchanged from the PDF
>>>>> source.
>>>> Hmmmmmm, since we are already changing the "real" representation of the
>>>> text (i.e transforming it from broken to not broken), i think i prefer one
>>>> that is easy to use (i.e. shows ö in most of the tools), what do others
>>>> think?
>>> Since the others are not there, please do what i want and output a real ö
>> If you're going to apply a Unicode normalization process, please use
>> NFC rather than NFKC. This will deal with creating precomposed
>> letter+accent combinations, but avoids introducing "compatibility"
>> changes that may lose significant distinctions in the text.
> For reference:
> NFC = pre-composed
> NFKC = pre-composed plus simplified ligatures ('fi' => 'f'+'i')

NFKC will do much more than that; for example, mapping super- and subscripted letters to their "unstyled" counterparts. Even the ™ symbol is mapped to "TM". I don't think this would be desirable here.

(See http://minaret.info/test/normalize.msp to experiment with normalization forms.)

> I agree but there isn't an NFC in poppler. It seems a waste of time to
> be writing one from scratch in Poppler or is there really no Unicode
> library that provides normalisations?

The obvious example is ICU, but I doubt poppler wants to pull in a dependency on that. Though if it's not for core poppler but just a particular (poppler-based) tool, perhaps it's not such a bad idea.

glib also supports it, but may not be readily available everywhere that people want to use poppler.

A much smaller lib that includes an NFC function is TECkit <http://scripts.sil.org/teckit> (disclaimer: that was a project of mine), though it is not actively maintained these days, and could do with an update for Unicode 6.0. But the code is there, and updating the Unicode tables would be simple.

If poppler already has support for NFKC, I would expect it to be easy to support NFC as well - essentially, it just means using a subset of the decomposition tables. But I haven't actually looked at the code to see how this would work in practice.


More information about the poppler mailing list