[HarfBuzz] Normalization

Mon Aug 1 09:20:30 PDT 2011

Hi Behdad,

don't You plan to add some property to switch normaliation off?

Thanks

On Sun, Jul 24, 2011 at 5:37 AM, Behdad Esfahbod <behdad at behdad.org> wrote:
> Hi,
>
> If you paid any attention you may have noticed that I was hacking on getting
> composition / decomposition working.  And I'm glad to announce that it's all
> done and working and in master already.
>
> Here's the relevant source file:
>
>  http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc
>
> These rely on the two new Unicode callbacks compose() and decompose().  I've
> added similar APIs to glib (which will be shipped in 2.30), and the hb-glib
> glue layer uses those if available, or falls back to using g_utf8_normalize(),
> which is much much slower.  The hb-icu layer implements these using
> unorm_normalize() which has the same slowness problem.  If someone wants to
> look into adding compose()/decompose() API to ICU, that would be really cool.
>
> Here's a few lines about the design, copying from comments in the source:
>
> /*
>  * HIGHLEVEL DESIGN:
>  *
>  * This file exports one main function: _hb_ot_shape_normalize().
>  *
>  * This function closely reflects the Unicode Normalization Algorithm,
>  * yet it's different.  The shaper an either prefer decomposed (NFD) or
>  * composed (NFC).
>  *
>  * In general what happens is that: each grapheme is decomposed in a chain
>  * of 1:2 decompositions, marks reordered, and then recomposed if desires,
>  * so far it's like Unicode Normalization.  However, the decomposition and
>  * recomposition only happens if the font supports the resulting characters.
>  *
>  * The goals are:
>  *
>  *   - Try to render all canonically equivalent strings similarly.  To really
>  *     achieve this we have to always do the full decomposition and then
>  *     selectively recompose from there.  It's kinda too expensive though, so
>  *     we skip some cases.  For example, if composed is desired, we simply
>  *     don't touch 1-character clusters that are supported by the font, even
>  *     though their NFC may be different.
>  *
>  *   - When a font has a precomposed character for a sequence but the 'ccmp'
>  *     feature in the font is not adequate, form use the precomposed character
>  *     which typically has better mark positioning.
>  *
>  *   - When a font does not support a character but supports its
>  *     decomposition, well, use the decomposition.
>  *
>  *   - The Indic shaper requests decomposed output.  This will handle
>  *     splitting matra for the Indic shaper.
>  */
>
>  /* We do a farily straightforward yet custom normalization process in three
>   * separate rounds: decompose, reorder, recompose (if desired).  Currently
>   * this makes two buffer swaps.  We can make it faster by moving the last
>   * two rounds into the inner loop for the first round, but it's more
>   * readable this way. */
>
>
> Comments, feedback, and testing re functionality and performance is appreciated!
>
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>