[HarfBuzz] Normalization
Behdad Esfahbod
behdad at behdad.org
Sat Jul 23 20:37:49 PDT 2011
Hi,
If you paid any attention you may have noticed that I was hacking on getting
composition / decomposition working. And I'm glad to announce that it's all
done and working and in master already.
Here's the relevant source file:
http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc
These rely on the two new Unicode callbacks compose() and decompose(). I've
added similar APIs to glib (which will be shipped in 2.30), and the hb-glib
glue layer uses those if available, or falls back to using g_utf8_normalize(),
which is much much slower. The hb-icu layer implements these using
unorm_normalize() which has the same slowness problem. If someone wants to
look into adding compose()/decompose() API to ICU, that would be really cool.
Here's a few lines about the design, copying from comments in the source:
/*
* HIGHLEVEL DESIGN:
*
* This file exports one main function: _hb_ot_shape_normalize().
*
* This function closely reflects the Unicode Normalization Algorithm,
* yet it's different. The shaper an either prefer decomposed (NFD) or
* composed (NFC).
*
* In general what happens is that: each grapheme is decomposed in a chain
* of 1:2 decompositions, marks reordered, and then recomposed if desires,
* so far it's like Unicode Normalization. However, the decomposition and
* recomposition only happens if the font supports the resulting characters.
*
* The goals are:
*
* - Try to render all canonically equivalent strings similarly. To really
* achieve this we have to always do the full decomposition and then
* selectively recompose from there. It's kinda too expensive though, so
* we skip some cases. For example, if composed is desired, we simply
* don't touch 1-character clusters that are supported by the font, even
* though their NFC may be different.
*
* - When a font has a precomposed character for a sequence but the 'ccmp'
* feature in the font is not adequate, form use the precomposed character
* which typically has better mark positioning.
*
* - When a font does not support a character but supports its
* decomposition, well, use the decomposition.
*
* - The Indic shaper requests decomposed output. This will handle
* splitting matra for the Indic shaper.
*/
/* We do a farily straightforward yet custom normalization process in three
* separate rounds: decompose, reorder, recompose (if desired). Currently
* this makes two buffer swaps. We can make it faster by moving the last
* two rounds into the inner loop for the first round, but it's more
* readable this way. */
Comments, feedback, and testing re functionality and performance is appreciated!
Cheers,
behdad
More information about the HarfBuzz
mailing list