[HarfBuzz] Normalization
Butrus Damaskus
butrus.butrus at gmail.com
Mon Aug 1 09:20:30 PDT 2011
Hi Behdad,
don't You plan to add some property to switch normaliation off?
Thanks
On Sun, Jul 24, 2011 at 5:37 AM, Behdad Esfahbod <behdad at behdad.org> wrote:
> Hi,
>
> If you paid any attention you may have noticed that I was hacking on getting
> composition / decomposition working. And I'm glad to announce that it's all
> done and working and in master already.
>
> Here's the relevant source file:
>
> http://cgit.freedesktop.org/harfbuzz/tree/src/hb-ot-shape-normalize.cc
>
> These rely on the two new Unicode callbacks compose() and decompose(). I've
> added similar APIs to glib (which will be shipped in 2.30), and the hb-glib
> glue layer uses those if available, or falls back to using g_utf8_normalize(),
> which is much much slower. The hb-icu layer implements these using
> unorm_normalize() which has the same slowness problem. If someone wants to
> look into adding compose()/decompose() API to ICU, that would be really cool.
>
> Here's a few lines about the design, copying from comments in the source:
>
> /*
> * HIGHLEVEL DESIGN:
> *
> * This file exports one main function: _hb_ot_shape_normalize().
> *
> * This function closely reflects the Unicode Normalization Algorithm,
> * yet it's different. The shaper an either prefer decomposed (NFD) or
> * composed (NFC).
> *
> * In general what happens is that: each grapheme is decomposed in a chain
> * of 1:2 decompositions, marks reordered, and then recomposed if desires,
> * so far it's like Unicode Normalization. However, the decomposition and
> * recomposition only happens if the font supports the resulting characters.
> *
> * The goals are:
> *
> * - Try to render all canonically equivalent strings similarly. To really
> * achieve this we have to always do the full decomposition and then
> * selectively recompose from there. It's kinda too expensive though, so
> * we skip some cases. For example, if composed is desired, we simply
> * don't touch 1-character clusters that are supported by the font, even
> * though their NFC may be different.
> *
> * - When a font has a precomposed character for a sequence but the 'ccmp'
> * feature in the font is not adequate, form use the precomposed character
> * which typically has better mark positioning.
> *
> * - When a font does not support a character but supports its
> * decomposition, well, use the decomposition.
> *
> * - The Indic shaper requests decomposed output. This will handle
> * splitting matra for the Indic shaper.
> */
>
> /* We do a farily straightforward yet custom normalization process in three
> * separate rounds: decompose, reorder, recompose (if desired). Currently
> * this makes two buffer swaps. We can make it faster by moving the last
> * two rounds into the inner loop for the first round, but it's more
> * readable this way. */
>
>
> Comments, feedback, and testing re functionality and performance is appreciated!
>
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>
More information about the HarfBuzz
mailing list