[HarfBuzz] Hangul Shaper (was Re: an issue regarding discrepancy between Korean and Unicode standards

Dohyun Kim nomosnomos at gmail.com
Tue Apr 9 21:15:33 PDT 2013


2013/4/10 Behdad Esfahbod <behdad at behdad.org>:
> Hi,
>
> Ok, what you describe sounds very close to the OpenType spec:
>
>   http://www.microsoft.com/typography/otfntdev/hangulot/
>
> and what the ICU Layout Hangul shaper does.
>
> The one part I don't understand is the section "Compose Old Hangul Jamo
> combinations" under:
>
>   http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm
>
> I can't make sense of that part, since Appendix B does not list what the jamos
> compose to.
>
> Please review those documents and share any insights you may have.  I'll go
> ahead with implementing a shaper then.
>

This Hangul Opentype spec from microsoft is quite outdated.  It was
written in 2003, ten years ago from now.  In the meantime, KS X 1026-1
and Unicode 5.2 have been released in 2007 and 2009 respectively.
Unicode 5.2 has assigned code points to a number of new jamos, which
are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C,
U+D7B0..U+D7C6, and U+D7CB..U+D7FB.  Consequently, those items in
Appendix B that you pointed out are now all have their unicode code
points.  For instance, <U+1102 U+1109> has now become <U+115B>.
Before Unicode 5.2, Koreans could not help writing down <U+1102
U+1109> to represent the composite jamo which is composed of Choseong
Nieun and Choseong Sios.  Now it is a story of past.  Anyway, you can
find full list of composite jamos with their elements composing them
at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have
shown before as a reference.

Moreover, the microsoft spec has incorrect informations on several
points.  The section "Compose Old Hangul Jamo combinations" is one of
them.  This kind of jamo composition could not be done at pre-OTLS
stage brefore Unicode 5.2 was introduced, as there was no code points
of composed jamos at that time.  Jamo-to-jamo composition could be
done only at the stage of applying "ccmp" font feature.  Now we have
all composite jamos registered to Unicode, so a shaping engine can do
this composition before applying font features.  However, this kind of
composition is contrary to the spec of KS X 1026-1.  Section 5.3 of
this spec says that "two or more code positions of simple letters
cannot be concatenated to represent a single complex letter."
Certainly, this concatenation is allowed according to the Unicode
standard, though not recommended since the release of version 5.2.
Yes, we have just encountered another discrepancy between local and
global standards.  But, in our pratice, Koreans do not input
decomposed jamos to represent a single composite jamo any more.  Above
all, it turned out from my experiment on a windows machine that recent
version of Uniscribe does not compose jamo elements to a composte
jamo, even for those jamos which were not available before Unicode
5.2.  So I think it is better for us to ignore the section "Compose
Old Hangul Jamo combinations" and its Appendix B altogether.

Instead, Uniscribe sets boundaries between syllable blocks as I
mentioned before.  As we know that all the single and composte jamos
have their own code points, the rule to identify syllable blocks is
quite simple now:
    L V T? M?
where L is leading consonants including Choseong filler; V is medial
vowel including Jungseong filler; T is trailing consonants; M is
Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence
of specified character.  Before or after these jamo sequence,
uniscribe seems to set boundaries.  And what is important is that
Uniscribe composes jamos to syllable only when complete sequence of <L
V T?> matches precomposed Hangul syllable.  In other words, <L V OT>
is not composed and Uniscribe passes the sequence intact to the OTLS
precess.

Thanks a lot for your effort to support Hangul.
Best,

>
> On 13-04-06 01:32 PM, Dohyun Kim wrote:
>> 2013/4/6 Behdad Esfahbod <behdad at behdad.org>:
>>> On 13-04-05 06:45 AM, Dohyun Kim wrote:
>>>> 2013/4/5 Dohyun Kim <nomosnomos at gmail.com>:
>>>>> Sorry for the noise.
>>>>> I have booted on Windows machine and tested uniscribe a bit.  My guess
>>>>> on how uniscribe works on Hangul is:
>>>>>
>>>>> 1. decompose hangul syllables to jamos
>>>>>
>>>>> 2. compose single jamos to composite jamo as possible as can be
>>>>>     eg., U+1100 U+1100 => U+1101
>>>>>     Note:  mapping table for this composition is available at
>>>>>       ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map
>>>>>
>>>>
>>>> Well, after a bit more test, it turned out that this second process is
>>>> not what uniscribe does.  Sorry for my wrong information.  I have
>>>> guessed this on the basis of old unicode standard.  Recently unicode
>>>> also does not recommend to use multiple single jamos to get composite
>>>> jamo.
>>>>
>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single
>>>> lonely jamo which do not make up syllable block.
>>>
>>> Interesting.  So, for a lone T jamo, both 115F and 1160 are inserted?
>>
>> Yes, when fillers are inserted.  But actually uniscribe does not seem
>> to insert fillers.  Sorry for my immuture conclusion.  Today I have
>> downloaded harfbuzz win32 binary and tested some jamo texts using
>> hb-shape.  This utility gave me more accurate information than I could
>> obtain with the naked eye.  Contrary to my expectation, the output of
>> hb-shape did not have any traces of fillers.  So, it seems evident
>> that uniscribe does not insert fillers.  And it seems also evident
>> that uniscribe sets boundaries between syllable blocks, so that
>> multiple single jamos could not be concatenated to composite jamo.
>>
>> Let us suppose an input text <U+1100 U+AC00 U+11F0>.  I guess what
>> uniscribe does:
>>
>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0>
>>
>> 2. demarcate each syllable block by setting boundaries in-between: we
>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary.
>> Probably this is related to the so-called "cluster."  Yesterday I
>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler.
>> BTW, according to the old standard, U+1100 U+1100 are concatenated to
>> U+1101, so the result will be a single syllable block <U+1101 U+1161
>> U+11F0>.  Nowadays we do not need this jamo-to-jamo composition,
>> because all the jamos known until today are now registerd since
>> unicode version 5.2.
>>
>>  3. try to re-compose jamos to syllablle letter.  But as our sample
>> text matches the case of <L V OT>, nothing is converted.
>>
>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s>
>> where ".s" means sustituted glyph.
>>
>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in
>> their practice.  However, there remains some possibility that some
>> applications or libaries do pass to harfbuzz some unicode-normailized
>> text, in which case hafbuzz would give us incorrect result.  So I
>> changed my mind, and now I suggest an implementation of hangul shaper.
>>  It is not an urgent task, though;  harfbuzz works quite well already.
>>  However, we want harfbuzz as perfect as possible.
>>
>> Regards,
>>
>>
>>>>> 3. compose jamos to hangul syllable as possible as can be
>>>>>    Note:  this process complies with KSC 1026-1.  In other words, jamo
>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means
>>>>> leading consonant, V means medial vowel, OT means *old* trailing
>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul
>>>>> syllable equivalent to L V.
>>>>>
>>>>> 4. apply opentype layout features
>>>>>
>>>>> It is somewhat complicated but gives perfect result.  It satisfies
>>>>> both the Korean and Unicode standards.  Nevertheless, what current
>>>>> hafbuzz does is quite excellent as well and I am satisfied with it.  I
>>>>> am reporting just for reference.
>>>>>
>>
>
> --
> behdad
> http://behdad.org/



--
Dohyun Kim
College of Law, Dongguk University
Seoul, Republic of Korea



More information about the HarfBuzz mailing list