[HarfBuzz] Hangul Shaper (was Re: an issue regarding discrepancy between Korean and Unicode standards

Tue Apr 9 13:29:24 PDT 2013

Hi,

Ok, what you describe sounds very close to the OpenType spec:

  http://www.microsoft.com/typography/otfntdev/hangulot/

and what the ICU Layout Hangul shaper does.

The one part I don't understand is the section "Compose Old Hangul Jamo
combinations" under:

  http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm

I can't make sense of that part, since Appendix B does not list what the jamos
compose to.

Please review those documents and share any insights you may have.  I'll go
ahead with implementing a shaper then.

behdad

On 13-04-06 01:32 PM, Dohyun Kim wrote:
> 2013/4/6 Behdad Esfahbod <behdad at behdad.org>:
>> On 13-04-05 06:45 AM, Dohyun Kim wrote:
>>> 2013/4/5 Dohyun Kim <nomosnomos at gmail.com>:
>>>> Sorry for the noise.
>>>> I have booted on Windows machine and tested uniscribe a bit.  My guess
>>>> on how uniscribe works on Hangul is:
>>>>
>>>> 1. decompose hangul syllables to jamos
>>>>
>>>> 2. compose single jamos to composite jamo as possible as can be
>>>>     eg., U+1100 U+1100 => U+1101
>>>>     Note:  mapping table for this composition is available at
>>>>       ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map
>>>>
>>>
>>> Well, after a bit more test, it turned out that this second process is
>>> not what uniscribe does.  Sorry for my wrong information.  I have
>>> guessed this on the basis of old unicode standard.  Recently unicode
>>> also does not recommend to use multiple single jamos to get composite
>>> jamo.
>>>
>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single
>>> lonely jamo which do not make up syllable block.
>>
>> Interesting.  So, for a lone T jamo, both 115F and 1160 are inserted?
> 
> Yes, when fillers are inserted.  But actually uniscribe does not seem
> to insert fillers.  Sorry for my immuture conclusion.  Today I have
> downloaded harfbuzz win32 binary and tested some jamo texts using
> hb-shape.  This utility gave me more accurate information than I could
> obtain with the naked eye.  Contrary to my expectation, the output of
> hb-shape did not have any traces of fillers.  So, it seems evident
> that uniscribe does not insert fillers.  And it seems also evident
> that uniscribe sets boundaries between syllable blocks, so that
> multiple single jamos could not be concatenated to composite jamo.
> 
> Let us suppose an input text <U+1100 U+AC00 U+11F0>.  I guess what
> uniscribe does:
> 
> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0>
> 
> 2. demarcate each syllable block by setting boundaries in-between: we
> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary.
> Probably this is related to the so-called "cluster."  Yesterday I
> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler.
> BTW, according to the old standard, U+1100 U+1100 are concatenated to
> U+1101, so the result will be a single syllable block <U+1101 U+1161
> U+11F0>.  Nowadays we do not need this jamo-to-jamo composition,
> because all the jamos known until today are now registerd since
> unicode version 5.2.
> 
>  3. try to re-compose jamos to syllablle letter.  But as our sample
> text matches the case of <L V OT>, nothing is converted.
> 
> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s>
> where ".s" means sustituted glyph.
> 
> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in
> their practice.  However, there remains some possibility that some
> applications or libaries do pass to harfbuzz some unicode-normailized
> text, in which case hafbuzz would give us incorrect result.  So I
> changed my mind, and now I suggest an implementation of hangul shaper.
>  It is not an urgent task, though;  harfbuzz works quite well already.
>  However, we want harfbuzz as perfect as possible.
> 
> Regards,
> 
> 
>>>> 3. compose jamos to hangul syllable as possible as can be
>>>>    Note:  this process complies with KSC 1026-1.  In other words, jamo
>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means
>>>> leading consonant, V means medial vowel, OT means *old* trailing
>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul
>>>> syllable equivalent to L V.
>>>>
>>>> 4. apply opentype layout features
>>>>
>>>> It is somewhat complicated but gives perfect result.  It satisfies
>>>> both the Korean and Unicode standards.  Nevertheless, what current
>>>> hafbuzz does is quite excellent as well and I am satisfied with it.  I
>>>> am reporting just for reference.
>>>>
> 

-- 
behdad
http://behdad.org/