[HarfBuzz] Normalization

Sat Jul 23 20:37:49 PDT 2011

Hi,

If you paid any attention you may have noticed that I was hacking on getting
composition / decomposition working.  And I'm glad to announce that it's all
done and working and in master already.

Here's the relevant source file:

  http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc

These rely on the two new Unicode callbacks compose() and decompose().  I've
added similar APIs to glib (which will be shipped in 2.30), and the hb-glib
glue layer uses those if available, or falls back to using g_utf8_normalize(),
which is much much slower.  The hb-icu layer implements these using
unorm_normalize() which has the same slowness problem.  If someone wants to
look into adding compose()/decompose() API to ICU, that would be really cool.

Here's a few lines about the design, copying from comments in the source:

/*
 * HIGHLEVEL DESIGN:
 *
 * This file exports one main function: _hb_ot_shape_normalize().
 *
 * This function closely reflects the Unicode Normalization Algorithm,
 * yet it's different.  The shaper an either prefer decomposed (NFD) or
 * composed (NFC).
 *
 * In general what happens is that: each grapheme is decomposed in a chain
 * of 1:2 decompositions, marks reordered, and then recomposed if desires,
 * so far it's like Unicode Normalization.  However, the decomposition and
 * recomposition only happens if the font supports the resulting characters.
 *
 * The goals are:
 *
 *   - Try to render all canonically equivalent strings similarly.  To really
 *     achieve this we have to always do the full decomposition and then
 *     selectively recompose from there.  It's kinda too expensive though, so
 *     we skip some cases.  For example, if composed is desired, we simply
 *     don't touch 1-character clusters that are supported by the font, even
 *     though their NFC may be different.
 *
 *   - When a font has a precomposed character for a sequence but the 'ccmp'
 *     feature in the font is not adequate, form use the precomposed character
 *     which typically has better mark positioning.
 *
 *   - When a font does not support a character but supports its
 *     decomposition, well, use the decomposition.
 *
 *   - The Indic shaper requests decomposed output.  This will handle
 *     splitting matra for the Indic shaper.
 */

  /* We do a farily straightforward yet custom normalization process in three
   * separate rounds: decompose, reorder, recompose (if desired).  Currently
   * this makes two buffer swaps.  We can make it faster by moving the last
   * two rounds into the inner loop for the first round, but it's more
   * readable this way. */

Comments, feedback, and testing re functionality and performance is appreciated!

Cheers,
behdad