[HarfBuzz] hangul shaper patches

Mon Jan 20 08:02:33 PST 2014

And I found a minor bug:

--- hb-ot-shape-complex-hangul.cc.orig    2014-01-21 01:00:25.000000000 +0900
+++ hb-ot-shape-complex-hangul.cc    2014-01-21 00:57:44.000000000 +0900
@@ -214,7 +214,7 @@
     if (font->has_glyph (0x25cc))
     {
       hb_codepoint_t chars[2];
-      if (is_zero_width_char (font, u)) {
+      if (!is_zero_width_char (font, u)) {
         chars[0] = u;
         chars[1] = 0x25cc;
       } else {



2014/1/21 Dohyun Kim <nomosnomos at gmail.com>:
> Thank you so much, Jonathan.
> Your patches to hangul shaper works really great.
>
> 2014/1/20 Jonathan Kew <jfkthame at googlemail.com>:
>> On 20/1/14 02:21, Roozbeh Pournader wrote:
>>>
>>> Jonathan,
>>>
>>> I was wondering if the new patches would have all the canonically
>>> equivalent characters sequences rendered the same way. Microsoft people
>>> have said publicly that their Hangul shaper intentionally doesn't do that.
>>>
>>
>> The intention is that canonically equivalent sequences should render the
>> same. I'm aware that MS doesn't do this in certain cases, as mentioned:
>>
>>
>>>          (b) a
>>>     handful of words where there's an <LV, T> sequence that uniscribe
>>>     doesn't support (it has no corresponding LVT syllable), but we
>>>     handle by decomposing to <L, V, T> and applying jamo features.
>>
>>
>> An example of this is <U+B4C0,U+11F0>, where uniscribe (using Malgun Gothic)
>> renders the two default, unshaped glyphs for U+B4C0 (an LV syllable) and
>> U+11F0 (a trailing jamo) separately, while harfbuzz decomposes U+B4C0 into
>> separate leading- and vowel-jamo glyphs and then applies ljmo/vjmo/tjmo
>> features so that the three jamos are properly composed into a single
>> syllable block.
>>
>> Thus, with harfbuzz the two sequences
>>   <U+B4C0,U+11F0>
>>   <U+1103,U+1172,U+11F0>
>> render the same. As I understand things, the Korean standard says the former
>> spelling should not be used, but IMO that cannot override the fact that the
>> Unicode standard defines them as canonically equivalent, so rendering them
>> identically is correct.
>>
>> What the patched harfbuzz still -doesn't- implement is shaping "spelled out"
>> versions of Old Hangul sequences with multiple L, V and/or T jamos. The old
>> MS Hangul spec gave an example where the leading jamo now encoded at U+A972
>> (CHOSEONG PIEUP-SIOS-THIEUTH) was encoded as the sequence
>> <U+1107,U+1109,U+1110> and then composed (and similarly for the V and T
>> jamos), so that a complete syllable was composed from a sequence of the form
>> <L, L, L, V, V, V, T, T, T>.
>>
>> I experimented with a patch that would support this, and the result looked
>> OK (to my un-Korean eyes) when using the UnBatang font (not so good with
>> Malgun Gothic). However, this is not canonically equivalent, and my
>> understanding is that with Unicode having added all the complex jamos, there
>> is no longer any real requirement or desire to support such sequences. So I
>> haven't included this.
>>
>
> I just have tested this kind of input string and the result is a
> little disappointing:
> Input string <U+1107,U+1109,U+1110,U+1161> does not rendered well. The
> output of current (patched) harfbuzz with UnBatang font is
> [uni1121=0+1024|uniD0C0=2+1024], the expected output being
> [uniA972.xxxx|uni1161.xxxx]
>
> The reason seems to be that we are currently applying "ccmp" opentype
> feature too late. If "ccmp" feature could be applied before the
> process of hangul shaper, the issue would disappear.
>
> Best Regards,
> --
> Dohyun Kim
> Seoul, Republic of Korea



-- 
Dohyun Kim
Seoul, Republic of Korea