[HarfBuzz] an issue regarding discrepancy between Korean and Unicode standards

Dohyun Kim nomosnomos at gmail.com
Sat Apr 6 10:32:16 PDT 2013


2013/4/6 Behdad Esfahbod <behdad at behdad.org>:
> On 13-04-05 06:45 AM, Dohyun Kim wrote:
>> 2013/4/5 Dohyun Kim <nomosnomos at gmail.com>:
>>> Sorry for the noise.
>>> I have booted on Windows machine and tested uniscribe a bit.  My guess
>>> on how uniscribe works on Hangul is:
>>>
>>> 1. decompose hangul syllables to jamos
>>>
>>> 2. compose single jamos to composite jamo as possible as can be
>>>     eg., U+1100 U+1100 => U+1101
>>>     Note:  mapping table for this composition is available at
>>>       ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map
>>>
>>
>> Well, after a bit more test, it turned out that this second process is
>> not what uniscribe does.  Sorry for my wrong information.  I have
>> guessed this on the basis of old unicode standard.  Recently unicode
>> also does not recommend to use multiple single jamos to get composite
>> jamo.
>>
>> Instead, uniscribe inserts fillers (U+115F U+1160) around single
>> lonely jamo which do not make up syllable block.
>
> Interesting.  So, for a lone T jamo, both 115F and 1160 are inserted?

Yes, when fillers are inserted.  But actually uniscribe does not seem
to insert fillers.  Sorry for my immuture conclusion.  Today I have
downloaded harfbuzz win32 binary and tested some jamo texts using
hb-shape.  This utility gave me more accurate information than I could
obtain with the naked eye.  Contrary to my expectation, the output of
hb-shape did not have any traces of fillers.  So, it seems evident
that uniscribe does not insert fillers.  And it seems also evident
that uniscribe sets boundaries between syllable blocks, so that
multiple single jamos could not be concatenated to composite jamo.

Let us suppose an input text <U+1100 U+AC00 U+11F0>.  I guess what
uniscribe does:

1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0>

2. demarcate each syllable block by setting boundaries in-between: we
get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary.
Probably this is related to the so-called "cluster."  Yesterday I
misconceived this boundary (maybe ZWNJ but I am not sure) as a filler.
BTW, according to the old standard, U+1100 U+1100 are concatenated to
U+1101, so the result will be a single syllable block <U+1101 U+1161
U+11F0>.  Nowadays we do not need this jamo-to-jamo composition,
because all the jamos known until today are now registerd since
unicode version 5.2.

 3. try to re-compose jamos to syllablle letter.  But as our sample
text matches the case of <L V OT>, nothing is converted.

4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s>
where ".s" means sustituted glyph.

As I said before, we Koreans do not input text like <U+AC00 U+11F0> in
their practice.  However, there remains some possibility that some
applications or libaries do pass to harfbuzz some unicode-normailized
text, in which case hafbuzz would give us incorrect result.  So I
changed my mind, and now I suggest an implementation of hangul shaper.
 It is not an urgent task, though;  harfbuzz works quite well already.
 However, we want harfbuzz as perfect as possible.

Regards,


>>> 3. compose jamos to hangul syllable as possible as can be
>>>    Note:  this process complies with KSC 1026-1.  In other words, jamo
>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means
>>> leading consonant, V means medial vowel, OT means *old* trailing
>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul
>>> syllable equivalent to L V.
>>>
>>> 4. apply opentype layout features
>>>
>>> It is somewhat complicated but gives perfect result.  It satisfies
>>> both the Korean and Unicode standards.  Nevertheless, what current
>>> hafbuzz does is quite excellent as well and I am satisfied with it.  I
>>> am reporting just for reference.
>>>



More information about the HarfBuzz mailing list