[HarfBuzz] Hangul Shaper (was Re: an issue regarding discrepancy between Korean and Unicode standards

Tue Apr 9 23:45:28 PDT 2013

2013/4/10 Dohyun Kim <nomosnomos at gmail.com>:
> 2013/4/10 Behdad Esfahbod <behdad at behdad.org>:
>> Hi,
>>
>> Ok, what you describe sounds very close to the OpenType spec:
>>
>>   http://www.microsoft.com/typography/otfntdev/hangulot/
>>
>> and what the ICU Layout Hangul shaper does.
>>
>> The one part I don't understand is the section "Compose Old Hangul Jamo
>> combinations" under:
>>
>>   http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm
>>
>> I can't make sense of that part, since Appendix B does not list what the jamos
>> compose to.
>>
>> Please review those documents and share any insights you may have.  I'll go
>> ahead with implementing a shaper then.
>>
>
> This Hangul Opentype spec from microsoft is quite outdated.  It was
> written in 2003, ten years ago from now.  In the meantime, KS X 1026-1
> and Unicode 5.2 have been released in 2007 and 2009 respectively.
> Unicode 5.2 has assigned code points to a number of new jamos, which
> are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C,
> U+D7B0..U+D7C6, and U+D7CB..U+D7FB.  Consequently, those items in
> Appendix B that you pointed out are now all have their unicode code
> points.  For instance, <U+1102 U+1109> has now become <U+115B>.
> Before Unicode 5.2, Koreans could not help writing down <U+1102
> U+1109> to represent the composite jamo which is composed of Choseong
> Nieun and Choseong Sios.  Now it is a story of past.  Anyway, you can
> find full list of composite jamos with their elements composing them
> at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have
> shown before as a reference.
>
> Moreover, the microsoft spec has incorrect informations on several
> points.  The section "Compose Old Hangul Jamo combinations" is one of
> them.  This kind of jamo composition could not be done at pre-OTLS
> stage brefore Unicode 5.2 was introduced, as there was no code points
> of composed jamos at that time.  Jamo-to-jamo composition could be
> done only at the stage of applying "ccmp" font feature.  Now we have
> all composite jamos registered to Unicode, so a shaping engine can do
> this composition before applying font features.  However, this kind of
> composition is contrary to the spec of KS X 1026-1.  Section 5.3 of
> this spec says that "two or more code positions of simple letters
> cannot be concatenated to represent a single complex letter."
> Certainly, this concatenation is allowed according to the Unicode
> standard, though not recommended since the release of version 5.2.
> Yes, we have just encountered another discrepancy between local and
> global standards.  But, in our pratice, Koreans do not input
> decomposed jamos to represent a single composite jamo any more.  Above
> all, it turned out from my experiment on a windows machine that recent
> version of Uniscribe does not compose jamo elements to a composte
> jamo, even for those jamos which were not available before Unicode
> 5.2.  So I think it is better for us to ignore the section "Compose
> Old Hangul Jamo combinations" and its Appendix B altogether.
>
> Instead, Uniscribe sets boundaries between syllable blocks as I
> mentioned before.  As we know that all the single and composte jamos
> have their own code points, the rule to identify syllable blocks is
> quite simple now:
>     L V T? M?

Today I have tested Uniscribe again.  It turned out that Uniscribe
does not simply apply this rule to identify syllable blocks.  When a
jamo sequence is a candidate to be composed to a composite jamo newly
added to Unicode 5.2, Uniscribe considers it as a single jamo, though
it does *not* actually compose the sequence to the composite jamo.  As
this may be a little confusing, let us take some examples.  For each
input text of left side, Uniscribe sets boundaries as the right side:

<U+1100 U+1100 U+1161> => <U+1100 | U+1100 U+1161> => <U+1100 | U+AC00>

<U+1100 U+1100> is a sequence which can be concatenated to <U+1101>.
However, Uniscribe divides them into two syllable blocks, because
U+1101 has been registered to Unicode from its very early versions.

<U+1103 U+1106 U+1161> => <U+1103 U+1106 U+1161>

<U+1103 U+1106> can be concatenated to <U+A960>, a newly registred
jamo by Unicode version 5.2.  In this case Uniscribe considers them as
a single composite jamo and so does not set boundary between U+1103
and U+1106.  Notice that Uniscribe does not actually compose these
element jamos to U+A960, just allowing font features do their job.

<U+1100 U+1161 U+11AB U+11AB> => <U+1100 U+1161 U+11AB U+11AB>

In a similar fasion, as <U+11AB U+11AB> can be concatenated to
<U+11FF> which is a newly added jamo, Uniscribe does not divide
syllable blocks in-between.

This policy of Uniscribe seems to be a little complicated.  But it
must be quite resonable as it also supports old documents which had
been written before Unicode 5.2 was introduced, ensuring backward
compatibility.

> where L is leading consonants including Choseong filler; V is medial
> vowel including Jungseong filler; T is trailing consonants; M is
> Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence
> of specified character.  Before or after these jamo sequence,
> uniscribe seems to set boundaries.  And what is important is that
> Uniscribe composes jamos to syllable only when complete sequence of <L
> V T?> matches precomposed Hangul syllable.  In other words, <L V OT>
> is not composed and Uniscribe passes the sequence intact to the OTLS
> precess.
>
> Thanks a lot for your effort to support Hangul.
> Best,
>
>>
>> On 13-04-06 01:32 PM, Dohyun Kim wrote:
>>> 2013/4/6 Behdad Esfahbod <behdad at behdad.org>:
>>>> On 13-04-05 06:45 AM, Dohyun Kim wrote:
>>>>> 2013/4/5 Dohyun Kim <nomosnomos at gmail.com>:
>>>>>> Sorry for the noise.
>>>>>> I have booted on Windows machine and tested uniscribe a bit.  My guess
>>>>>> on how uniscribe works on Hangul is:
>>>>>>
>>>>>> 1. decompose hangul syllables to jamos
>>>>>>
>>>>>> 2. compose single jamos to composite jamo as possible as can be
>>>>>>     eg., U+1100 U+1100 => U+1101
>>>>>>     Note:  mapping table for this composition is available at
>>>>>>       ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map
>>>>>>
>>>>>
>>>>> Well, after a bit more test, it turned out that this second process is
>>>>> not what uniscribe does.  Sorry for my wrong information.  I have
>>>>> guessed this on the basis of old unicode standard.  Recently unicode
>>>>> also does not recommend to use multiple single jamos to get composite
>>>>> jamo.
>>>>>
>>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single
>>>>> lonely jamo which do not make up syllable block.
>>>>
>>>> Interesting.  So, for a lone T jamo, both 115F and 1160 are inserted?
>>>
>>> Yes, when fillers are inserted.  But actually uniscribe does not seem
>>> to insert fillers.  Sorry for my immuture conclusion.  Today I have
>>> downloaded harfbuzz win32 binary and tested some jamo texts using
>>> hb-shape.  This utility gave me more accurate information than I could
>>> obtain with the naked eye.  Contrary to my expectation, the output of
>>> hb-shape did not have any traces of fillers.  So, it seems evident
>>> that uniscribe does not insert fillers.  And it seems also evident
>>> that uniscribe sets boundaries between syllable blocks, so that
>>> multiple single jamos could not be concatenated to composite jamo.
>>>
>>> Let us suppose an input text <U+1100 U+AC00 U+11F0>.  I guess what
>>> uniscribe does:
>>>
>>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0>
>>>
>>> 2. demarcate each syllable block by setting boundaries in-between: we
>>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary.
>>> Probably this is related to the so-called "cluster."  Yesterday I
>>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler.
>>> BTW, according to the old standard, U+1100 U+1100 are concatenated to
>>> U+1101, so the result will be a single syllable block <U+1101 U+1161
>>> U+11F0>.  Nowadays we do not need this jamo-to-jamo composition,
>>> because all the jamos known until today are now registerd since
>>> unicode version 5.2.
>>>
>>>  3. try to re-compose jamos to syllablle letter.  But as our sample
>>> text matches the case of <L V OT>, nothing is converted.
>>>
>>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s>
>>> where ".s" means sustituted glyph.
>>>
>>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in
>>> their practice.  However, there remains some possibility that some
>>> applications or libaries do pass to harfbuzz some unicode-normailized
>>> text, in which case hafbuzz would give us incorrect result.  So I
>>> changed my mind, and now I suggest an implementation of hangul shaper.
>>>  It is not an urgent task, though;  harfbuzz works quite well already.
>>>  However, we want harfbuzz as perfect as possible.
>>>
>>> Regards,
>>>
>>>
>>>>>> 3. compose jamos to hangul syllable as possible as can be
>>>>>>    Note:  this process complies with KSC 1026-1.  In other words, jamo
>>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means
>>>>>> leading consonant, V means medial vowel, OT means *old* trailing
>>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul
>>>>>> syllable equivalent to L V.
>>>>>>
>>>>>> 4. apply opentype layout features
>>>>>>
>>>>>> It is somewhat complicated but gives perfect result.  It satisfies
>>>>>> both the Korean and Unicode standards.  Nevertheless, what current
>>>>>> hafbuzz does is quite excellent as well and I am satisfied with it.  I
>>>>>> am reporting just for reference.
>>>>>>
>>>
>>
>> --
>> behdad
>> http://behdad.org/
>
>
>
> --
> Dohyun Kim
> College of Law, Dongguk University
> Seoul, Republic of Korea

--
Dohyun Kim
College of Law, Dongguk University
Seoul, Republic of Korea