[HarfBuzz] Hangul Shaper (was Re: an issue regarding discrepancy between Korean and Unicode standards

Behdad Esfahbod behdad at behdad.org
Thu Apr 11 10:44:52 PDT 2013


Ok, I'm more confused now :).  I'll find some time to put something together
and take it from there.  In the mean time, if you can compile a list of
sequences that would test all the corner cases you can think of, that would
immensely help with the implementation.

Thanks,
b

On 13-04-10 02:45 AM, Dohyun Kim wrote:
> 2013/4/10 Dohyun Kim <nomosnomos at gmail.com>:
>> 2013/4/10 Behdad Esfahbod <behdad at behdad.org>:
>>> Hi,
>>>
>>> Ok, what you describe sounds very close to the OpenType spec:
>>>
>>>   http://www.microsoft.com/typography/otfntdev/hangulot/
>>>
>>> and what the ICU Layout Hangul shaper does.
>>>
>>> The one part I don't understand is the section "Compose Old Hangul Jamo
>>> combinations" under:
>>>
>>>   http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm
>>>
>>> I can't make sense of that part, since Appendix B does not list what the jamos
>>> compose to.
>>>
>>> Please review those documents and share any insights you may have.  I'll go
>>> ahead with implementing a shaper then.
>>>
>>
>> This Hangul Opentype spec from microsoft is quite outdated.  It was
>> written in 2003, ten years ago from now.  In the meantime, KS X 1026-1
>> and Unicode 5.2 have been released in 2007 and 2009 respectively.
>> Unicode 5.2 has assigned code points to a number of new jamos, which
>> are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C,
>> U+D7B0..U+D7C6, and U+D7CB..U+D7FB.  Consequently, those items in
>> Appendix B that you pointed out are now all have their unicode code
>> points.  For instance, <U+1102 U+1109> has now become <U+115B>.
>> Before Unicode 5.2, Koreans could not help writing down <U+1102
>> U+1109> to represent the composite jamo which is composed of Choseong
>> Nieun and Choseong Sios.  Now it is a story of past.  Anyway, you can
>> find full list of composite jamos with their elements composing them
>> at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have
>> shown before as a reference.
>>
>> Moreover, the microsoft spec has incorrect informations on several
>> points.  The section "Compose Old Hangul Jamo combinations" is one of
>> them.  This kind of jamo composition could not be done at pre-OTLS
>> stage brefore Unicode 5.2 was introduced, as there was no code points
>> of composed jamos at that time.  Jamo-to-jamo composition could be
>> done only at the stage of applying "ccmp" font feature.  Now we have
>> all composite jamos registered to Unicode, so a shaping engine can do
>> this composition before applying font features.  However, this kind of
>> composition is contrary to the spec of KS X 1026-1.  Section 5.3 of
>> this spec says that "two or more code positions of simple letters
>> cannot be concatenated to represent a single complex letter."
>> Certainly, this concatenation is allowed according to the Unicode
>> standard, though not recommended since the release of version 5.2.
>> Yes, we have just encountered another discrepancy between local and
>> global standards.  But, in our pratice, Koreans do not input
>> decomposed jamos to represent a single composite jamo any more.  Above
>> all, it turned out from my experiment on a windows machine that recent
>> version of Uniscribe does not compose jamo elements to a composte
>> jamo, even for those jamos which were not available before Unicode
>> 5.2.  So I think it is better for us to ignore the section "Compose
>> Old Hangul Jamo combinations" and its Appendix B altogether.
>>
>> Instead, Uniscribe sets boundaries between syllable blocks as I
>> mentioned before.  As we know that all the single and composte jamos
>> have their own code points, the rule to identify syllable blocks is
>> quite simple now:
>>     L V T? M?
> 
> Today I have tested Uniscribe again.  It turned out that Uniscribe
> does not simply apply this rule to identify syllable blocks.  When a
> jamo sequence is a candidate to be composed to a composite jamo newly
> added to Unicode 5.2, Uniscribe considers it as a single jamo, though
> it does *not* actually compose the sequence to the composite jamo.  As
> this may be a little confusing, let us take some examples.  For each
> input text of left side, Uniscribe sets boundaries as the right side:
> 
> <U+1100 U+1100 U+1161> => <U+1100 | U+1100 U+1161> => <U+1100 | U+AC00>
> 
> <U+1100 U+1100> is a sequence which can be concatenated to <U+1101>.
> However, Uniscribe divides them into two syllable blocks, because
> U+1101 has been registered to Unicode from its very early versions.
> 
> <U+1103 U+1106 U+1161> => <U+1103 U+1106 U+1161>
> 
> <U+1103 U+1106> can be concatenated to <U+A960>, a newly registred
> jamo by Unicode version 5.2.  In this case Uniscribe considers them as
> a single composite jamo and so does not set boundary between U+1103
> and U+1106.  Notice that Uniscribe does not actually compose these
> element jamos to U+A960, just allowing font features do their job.
> 
> <U+1100 U+1161 U+11AB U+11AB> => <U+1100 U+1161 U+11AB U+11AB>
> 
> In a similar fasion, as <U+11AB U+11AB> can be concatenated to
> <U+11FF> which is a newly added jamo, Uniscribe does not divide
> syllable blocks in-between.
> 
> This policy of Uniscribe seems to be a little complicated.  But it
> must be quite resonable as it also supports old documents which had
> been written before Unicode 5.2 was introduced, ensuring backward
> compatibility.
> 
> 
>> where L is leading consonants including Choseong filler; V is medial
>> vowel including Jungseong filler; T is trailing consonants; M is
>> Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence
>> of specified character.  Before or after these jamo sequence,
>> uniscribe seems to set boundaries.  And what is important is that
>> Uniscribe composes jamos to syllable only when complete sequence of <L
>> V T?> matches precomposed Hangul syllable.  In other words, <L V OT>
>> is not composed and Uniscribe passes the sequence intact to the OTLS
>> precess.
>>
>> Thanks a lot for your effort to support Hangul.
>> Best,
>>
>>>
>>> On 13-04-06 01:32 PM, Dohyun Kim wrote:
>>>> 2013/4/6 Behdad Esfahbod <behdad at behdad.org>:
>>>>> On 13-04-05 06:45 AM, Dohyun Kim wrote:
>>>>>> 2013/4/5 Dohyun Kim <nomosnomos at gmail.com>:
>>>>>>> Sorry for the noise.
>>>>>>> I have booted on Windows machine and tested uniscribe a bit.  My guess
>>>>>>> on how uniscribe works on Hangul is:
>>>>>>>
>>>>>>> 1. decompose hangul syllables to jamos
>>>>>>>
>>>>>>> 2. compose single jamos to composite jamo as possible as can be
>>>>>>>     eg., U+1100 U+1100 => U+1101
>>>>>>>     Note:  mapping table for this composition is available at
>>>>>>>       ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map
>>>>>>>
>>>>>>
>>>>>> Well, after a bit more test, it turned out that this second process is
>>>>>> not what uniscribe does.  Sorry for my wrong information.  I have
>>>>>> guessed this on the basis of old unicode standard.  Recently unicode
>>>>>> also does not recommend to use multiple single jamos to get composite
>>>>>> jamo.
>>>>>>
>>>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single
>>>>>> lonely jamo which do not make up syllable block.
>>>>>
>>>>> Interesting.  So, for a lone T jamo, both 115F and 1160 are inserted?
>>>>
>>>> Yes, when fillers are inserted.  But actually uniscribe does not seem
>>>> to insert fillers.  Sorry for my immuture conclusion.  Today I have
>>>> downloaded harfbuzz win32 binary and tested some jamo texts using
>>>> hb-shape.  This utility gave me more accurate information than I could
>>>> obtain with the naked eye.  Contrary to my expectation, the output of
>>>> hb-shape did not have any traces of fillers.  So, it seems evident
>>>> that uniscribe does not insert fillers.  And it seems also evident
>>>> that uniscribe sets boundaries between syllable blocks, so that
>>>> multiple single jamos could not be concatenated to composite jamo.
>>>>
>>>> Let us suppose an input text <U+1100 U+AC00 U+11F0>.  I guess what
>>>> uniscribe does:
>>>>
>>>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0>
>>>>
>>>> 2. demarcate each syllable block by setting boundaries in-between: we
>>>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary.
>>>> Probably this is related to the so-called "cluster."  Yesterday I
>>>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler.
>>>> BTW, according to the old standard, U+1100 U+1100 are concatenated to
>>>> U+1101, so the result will be a single syllable block <U+1101 U+1161
>>>> U+11F0>.  Nowadays we do not need this jamo-to-jamo composition,
>>>> because all the jamos known until today are now registerd since
>>>> unicode version 5.2.
>>>>
>>>>  3. try to re-compose jamos to syllablle letter.  But as our sample
>>>> text matches the case of <L V OT>, nothing is converted.
>>>>
>>>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s>
>>>> where ".s" means sustituted glyph.
>>>>
>>>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in
>>>> their practice.  However, there remains some possibility that some
>>>> applications or libaries do pass to harfbuzz some unicode-normailized
>>>> text, in which case hafbuzz would give us incorrect result.  So I
>>>> changed my mind, and now I suggest an implementation of hangul shaper.
>>>>  It is not an urgent task, though;  harfbuzz works quite well already.
>>>>  However, we want harfbuzz as perfect as possible.
>>>>
>>>> Regards,
>>>>
>>>>
>>>>>>> 3. compose jamos to hangul syllable as possible as can be
>>>>>>>    Note:  this process complies with KSC 1026-1.  In other words, jamo
>>>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means
>>>>>>> leading consonant, V means medial vowel, OT means *old* trailing
>>>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul
>>>>>>> syllable equivalent to L V.
>>>>>>>
>>>>>>> 4. apply opentype layout features
>>>>>>>
>>>>>>> It is somewhat complicated but gives perfect result.  It satisfies
>>>>>>> both the Korean and Unicode standards.  Nevertheless, what current
>>>>>>> hafbuzz does is quite excellent as well and I am satisfied with it.  I
>>>>>>> am reporting just for reference.
>>>>>>>
>>>>
>>>
>>> --
>>> behdad
>>> http://behdad.org/
>>
>>
>>
>> --
>> Dohyun Kim
>> College of Law, Dongguk University
>> Seoul, Republic of Korea
> 
> 
> 
> --
> Dohyun Kim
> College of Law, Dongguk University
> Seoul, Republic of Korea
> 

-- 
behdad
http://behdad.org/



More information about the HarfBuzz mailing list