[HarfBuzz] Normalization

Mon Aug 1 11:41:01 PDT 2011

On 08/01/11 12:20, Butrus Damaskus wrote:
> Hi Behdad,
> 
> don't You plan to add some property to switch normaliation off?

No.  Why?

behdad

> Thanks
> 
> On Sun, Jul 24, 2011 at 5:37 AM, Behdad Esfahbod <behdad at behdad.org> wrote:
>> Hi,
>>
>> If you paid any attention you may have noticed that I was hacking on getting
>> composition / decomposition working.  And I'm glad to announce that it's all
>> done and working and in master already.
>>
>> Here's the relevant source file:
>>
>>  http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc
>>
>> These rely on the two new Unicode callbacks compose() and decompose().  I've
>> added similar APIs to glib (which will be shipped in 2.30), and the hb-glib
>> glue layer uses those if available, or falls back to using g_utf8_normalize(),
>> which is much much slower.  The hb-icu layer implements these using
>> unorm_normalize() which has the same slowness problem.  If someone wants to
>> look into adding compose()/decompose() API to ICU, that would be really cool.
>>
>> Here's a few lines about the design, copying from comments in the source:
>>
>> /*
>>  * HIGHLEVEL DESIGN:
>>  *
>>  * This file exports one main function: _hb_ot_shape_normalize().
>>  *
>>  * This function closely reflects the Unicode Normalization Algorithm,
>>  * yet it's different.  The shaper an either prefer decomposed (NFD) or
>>  * composed (NFC).
>>  *
>>  * In general what happens is that: each grapheme is decomposed in a chain
>>  * of 1:2 decompositions, marks reordered, and then recomposed if desires,
>>  * so far it's like Unicode Normalization.  However, the decomposition and
>>  * recomposition only happens if the font supports the resulting characters.
>>  *
>>  * The goals are:
>>  *
>>  *   - Try to render all canonically equivalent strings similarly.  To really
>>  *     achieve this we have to always do the full decomposition and then
>>  *     selectively recompose from there.  It's kinda too expensive though, so
>>  *     we skip some cases.  For example, if composed is desired, we simply
>>  *     don't touch 1-character clusters that are supported by the font, even
>>  *     though their NFC may be different.
>>  *
>>  *   - When a font has a precomposed character for a sequence but the 'ccmp'
>>  *     feature in the font is not adequate, form use the precomposed character
>>  *     which typically has better mark positioning.
>>  *
>>  *   - When a font does not support a character but supports its
>>  *     decomposition, well, use the decomposition.
>>  *
>>  *   - The Indic shaper requests decomposed output.  This will handle
>>  *     splitting matra for the Indic shaper.
>>  */
>>
>>  /* We do a farily straightforward yet custom normalization process in three
>>   * separate rounds: decompose, reorder, recompose (if desired).  Currently
>>   * this makes two buffer swaps.  We can make it faster by moving the last
>>   * two rounds into the inner loop for the first round, but it's more
>>   * readable this way. */
>>
>>
>> Comments, feedback, and testing re functionality and performance is appreciated!
>>
>> Cheers,
>> behdad
>> _______________________________________________
>> HarfBuzz mailing list
>> HarfBuzz at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>>
>