[HarfBuzz] Hangul Shaper (was Re: an issue regarding discrepancy between Korean and Unicode standards

Fri Apr 12 10:03:51 PDT 2013

Please ignore my previous mail.  Latest version of Uniscribe does not
work that way.

I was using rather outdated version of Uniscribe until yesterday.  At
last today I had a chance to access a Windows 8 machine and used it
for a while.  In short, Uniscribe in Windows 8 is completely following
KS X 1026-1 only and no more.  Unicode spec has been thrown away.
Personally I don't like it.  Especially the reordering of Hangul tone
marks was remarkable.

Attached is a sample hangul text file.  Some lines are well-formed;
others contain mal-formed text.

2013/4/12 Behdad Esfahbod <behdad at behdad.org>:
> Ok, I'm more confused now :).  I'll find some time to put something together
> and take it from there.  In the mean time, if you can compile a list of
> sequences that would test all the corner cases you can think of, that would
> immensely help with the implementation.
>
> Thanks,
> b
>
> On 13-04-10 02:45 AM, Dohyun Kim wrote:
>> 2013/4/10 Dohyun Kim <nomosnomos at gmail.com>:
>>> 2013/4/10 Behdad Esfahbod <behdad at behdad.org>:
>>>> Hi,
>>>>
>>>> Ok, what you describe sounds very close to the OpenType spec:
>>>>
>>>>   http://www.microsoft.com/typography/otfntdev/hangulot/
>>>>
>>>> and what the ICU Layout Hangul shaper does.
>>>>
>>>> The one part I don't understand is the section "Compose Old Hangul Jamo
>>>> combinations" under:
>>>>
>>>>   http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm
>>>>
>>>> I can't make sense of that part, since Appendix B does not list what the jamos
>>>> compose to.
>>>>
>>>> Please review those documents and share any insights you may have.  I'll go
>>>> ahead with implementing a shaper then.
>>>>
>>>
>>> This Hangul Opentype spec from microsoft is quite outdated.  It was
>>> written in 2003, ten years ago from now.  In the meantime, KS X 1026-1
>>> and Unicode 5.2 have been released in 2007 and 2009 respectively.
>>> Unicode 5.2 has assigned code points to a number of new jamos, which
>>> are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C,
>>> U+D7B0..U+D7C6, and U+D7CB..U+D7FB.  Consequently, those items in
>>> Appendix B that you pointed out are now all have their unicode code
>>> points.  For instance, <U+1102 U+1109> has now become <U+115B>.
>>> Before Unicode 5.2, Koreans could not help writing down <U+1102
>>> U+1109> to represent the composite jamo which is composed of Choseong
>>> Nieun and Choseong Sios.  Now it is a story of past.  Anyway, you can
>>> find full list of composite jamos with their elements composing them
>>> at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have
>>> shown before as a reference.
>>>
>>> Moreover, the microsoft spec has incorrect informations on several
>>> points.  The section "Compose Old Hangul Jamo combinations" is one of
>>> them.  This kind of jamo composition could not be done at pre-OTLS
>>> stage brefore Unicode 5.2 was introduced, as there was no code points
>>> of composed jamos at that time.  Jamo-to-jamo composition could be
>>> done only at the stage of applying "ccmp" font feature.  Now we have
>>> all composite jamos registered to Unicode, so a shaping engine can do
>>> this composition before applying font features.  However, this kind of
>>> composition is contrary to the spec of KS X 1026-1.  Section 5.3 of
>>> this spec says that "two or more code positions of simple letters
>>> cannot be concatenated to represent a single complex letter."
>>> Certainly, this concatenation is allowed according to the Unicode
>>> standard, though not recommended since the release of version 5.2.
>>> Yes, we have just encountered another discrepancy between local and
>>> global standards.  But, in our pratice, Koreans do not input
>>> decomposed jamos to represent a single composite jamo any more.  Above
>>> all, it turned out from my experiment on a windows machine that recent
>>> version of Uniscribe does not compose jamo elements to a composte
>>> jamo, even for those jamos which were not available before Unicode
>>> 5.2.  So I think it is better for us to ignore the section "Compose
>>> Old Hangul Jamo combinations" and its Appendix B altogether.
>>>
>>> Instead, Uniscribe sets boundaries between syllable blocks as I
>>> mentioned before.  As we know that all the single and composte jamos
>>> have their own code points, the rule to identify syllable blocks is
>>> quite simple now:
>>>     L V T? M?
>>
>> Today I have tested Uniscribe again.  It turned out that Uniscribe
>> does not simply apply this rule to identify syllable blocks.  When a
>> jamo sequence is a candidate to be composed to a composite jamo newly
>> added to Unicode 5.2, Uniscribe considers it as a single jamo, though
>> it does *not* actually compose the sequence to the composite jamo.  As
>> this may be a little confusing, let us take some examples.  For each
>> input text of left side, Uniscribe sets boundaries as the right side:
>>
>> <U+1100 U+1100 U+1161> => <U+1100 | U+1100 U+1161> => <U+1100 | U+AC00>
>>
>> <U+1100 U+1100> is a sequence which can be concatenated to <U+1101>.
>> However, Uniscribe divides them into two syllable blocks, because
>> U+1101 has been registered to Unicode from its very early versions.
>>
>> <U+1103 U+1106 U+1161> => <U+1103 U+1106 U+1161>
>>
>> <U+1103 U+1106> can be concatenated to <U+A960>, a newly registred
>> jamo by Unicode version 5.2.  In this case Uniscribe considers them as
>> a single composite jamo and so does not set boundary between U+1103
>> and U+1106.  Notice that Uniscribe does not actually compose these
>> element jamos to U+A960, just allowing font features do their job.
>>
>> <U+1100 U+1161 U+11AB U+11AB> => <U+1100 U+1161 U+11AB U+11AB>
>>
>> In a similar fasion, as <U+11AB U+11AB> can be concatenated to
>> <U+11FF> which is a newly added jamo, Uniscribe does not divide
>> syllable blocks in-between.
>>
>> This policy of Uniscribe seems to be a little complicated.  But it
>> must be quite resonable as it also supports old documents which had
>> been written before Unicode 5.2 was introduced, ensuring backward
>> compatibility.
>>
>>
>>> where L is leading consonants including Choseong filler; V is medial
>>> vowel including Jungseong filler; T is trailing consonants; M is
>>> Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence
>>> of specified character.  Before or after these jamo sequence,
>>> uniscribe seems to set boundaries.  And what is important is that
>>> Uniscribe composes jamos to syllable only when complete sequence of <L
>>> V T?> matches precomposed Hangul syllable.  In other words, <L V OT>
>>> is not composed and Uniscribe passes the sequence intact to the OTLS
>>> precess.
>>>
>>> Thanks a lot for your effort to support Hangul.
>>> Best,
>>>
>>>>
>>>> On 13-04-06 01:32 PM, Dohyun Kim wrote:
>>>>> 2013/4/6 Behdad Esfahbod <behdad at behdad.org>:
>>>>>> On 13-04-05 06:45 AM, Dohyun Kim wrote:
>>>>>>> 2013/4/5 Dohyun Kim <nomosnomos at gmail.com>:
>>>>>>>> Sorry for the noise.
>>>>>>>> I have booted on Windows machine and tested uniscribe a bit.  My guess
>>>>>>>> on how uniscribe works on Hangul is:
>>>>>>>>
>>>>>>>> 1. decompose hangul syllables to jamos
>>>>>>>>
>>>>>>>> 2. compose single jamos to composite jamo as possible as can be
>>>>>>>>     eg., U+1100 U+1100 => U+1101
>>>>>>>>     Note:  mapping table for this composition is available at
>>>>>>>>       ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map
>>>>>>>>
>>>>>>>
>>>>>>> Well, after a bit more test, it turned out that this second process is
>>>>>>> not what uniscribe does.  Sorry for my wrong information.  I have
>>>>>>> guessed this on the basis of old unicode standard.  Recently unicode
>>>>>>> also does not recommend to use multiple single jamos to get composite
>>>>>>> jamo.
>>>>>>>
>>>>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single
>>>>>>> lonely jamo which do not make up syllable block.
>>>>>>
>>>>>> Interesting.  So, for a lone T jamo, both 115F and 1160 are inserted?
>>>>>
>>>>> Yes, when fillers are inserted.  But actually uniscribe does not seem
>>>>> to insert fillers.  Sorry for my immuture conclusion.  Today I have
>>>>> downloaded harfbuzz win32 binary and tested some jamo texts using
>>>>> hb-shape.  This utility gave me more accurate information than I could
>>>>> obtain with the naked eye.  Contrary to my expectation, the output of
>>>>> hb-shape did not have any traces of fillers.  So, it seems evident
>>>>> that uniscribe does not insert fillers.  And it seems also evident
>>>>> that uniscribe sets boundaries between syllable blocks, so that
>>>>> multiple single jamos could not be concatenated to composite jamo.
>>>>>
>>>>> Let us suppose an input text <U+1100 U+AC00 U+11F0>.  I guess what
>>>>> uniscribe does:
>>>>>
>>>>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0>
>>>>>
>>>>> 2. demarcate each syllable block by setting boundaries in-between: we
>>>>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary.
>>>>> Probably this is related to the so-called "cluster."  Yesterday I
>>>>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler.
>>>>> BTW, according to the old standard, U+1100 U+1100 are concatenated to
>>>>> U+1101, so the result will be a single syllable block <U+1101 U+1161
>>>>> U+11F0>.  Nowadays we do not need this jamo-to-jamo composition,
>>>>> because all the jamos known until today are now registerd since
>>>>> unicode version 5.2.
>>>>>
>>>>>  3. try to re-compose jamos to syllablle letter.  But as our sample
>>>>> text matches the case of <L V OT>, nothing is converted.
>>>>>
>>>>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s>
>>>>> where ".s" means sustituted glyph.
>>>>>
>>>>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in
>>>>> their practice.  However, there remains some possibility that some
>>>>> applications or libaries do pass to harfbuzz some unicode-normailized
>>>>> text, in which case hafbuzz would give us incorrect result.  So I
>>>>> changed my mind, and now I suggest an implementation of hangul shaper.
>>>>>  It is not an urgent task, though;  harfbuzz works quite well already.
>>>>>  However, we want harfbuzz as perfect as possible.
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>>>>> 3. compose jamos to hangul syllable as possible as can be
>>>>>>>>    Note:  this process complies with KSC 1026-1.  In other words, jamo
>>>>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means
>>>>>>>> leading consonant, V means medial vowel, OT means *old* trailing
>>>>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul
>>>>>>>> syllable equivalent to L V.
>>>>>>>>
>>>>>>>> 4. apply opentype layout features
>>>>>>>>
>>>>>>>> It is somewhat complicated but gives perfect result.  It satisfies
>>>>>>>> both the Korean and Unicode standards.  Nevertheless, what current
>>>>>>>> hafbuzz does is quite excellent as well and I am satisfied with it.  I
>>>>>>>> am reporting just for reference.
>>>>>>>>
>>>>>
>>>>
>>>> --
>>>> behdad
>>>> http://behdad.org/
>>>
>>>
>>>
>>> --
>>> Dohyun Kim
>>> College of Law, Dongguk University
>>> Seoul, Republic of Korea
>>
>>
>>
>> --
>> Dohyun Kim
>> College of Law, Dongguk University
>> Seoul, Republic of Korea
>>
>
> --
> behdad
> http://behdad.org/

-- 
Dohyun Kim
College of Law, Dongguk University
Seoul, Republic of Korea
-------------- next part --------------
?????
????
??????
???????
?????
???
????
??????
????
???????
??????
??
??????
???
????