[HarfBuzz] Hangul Shaper (was Re: an issue regarding discrepancy between Korean and Unicode standards

Behdad Esfahbod behdad at behdad.org
Tue Dec 31 01:53:51 PST 2013


I've now pushed a Hangul shaper out to HarfBuzz master.  Here's the comments
explaining what it tries to do:

  /* Hangul syllables come in two shapes: LV, and LVT.  Of those:
   *
   *   - LV can be precomposed, or decomposed.  Lets call those
   *     <LV> and <L,V>,
   *   - LVT can be fully precomposed, partically precomposed, or
   *     fully decomposed.  Ie. <LVT>, <LV,T>, or <L,V,T>.
   *
   * The composition / decomposition is mechanical.  However, not
   * all <L,V> sequences compose, and not all <LV,T> sequences
   * compose.
   *
   * Here are the specifics:
   *
   *   - <L>: U+1100..115F, U+A960..A97F
   *   - <V>: U+1160..11A7, U+D7B0..D7C7
   *   - <T>: U+11A8..11FF, U+D7C8..D7FF
   *
   *   - Only the <L,V> sequences for the 11xx ranges combine.
   *   - Only <LV,T> sequences for T in U+11A8..11C3 combine.
   *
   * Here is what we want to accomplish in this shaper:
   *
   *   - If the whole syllable can be precomposed, do that,
   *   - Otherwise, fully decompose.
   *
   * That is, of the different possible syllables:
   *
   *   <L>
   *   <L,V>
   *   <L,V,T>
   *   <LV>
   *   <LVT>
   *   <LV, T>
   *
   * - <L> needs no work.
   *
   * - <LV> and <LVT> can stay the way they are if the font supports them,
otherwise we
   *   should fully decompose them if font supports.
   *
   * - <L,V> and <L,V,T> we should compose if the whole thing can be composed.
   *
   * - <LV,T> we should compose if the whole thing can be composed, otherwise
we should
   *   decompose.
   */


Please test.

behdad

On 13-04-18 09:44 AM, Dohyun Kim wrote:
> 2013/4/18 Dohyun Kim <nomosnomos at gmail.com>:
>> 2013/4/18 Behdad Esfahbod <behdad at behdad.org>:
>>> When are the OpenType features applied, after all those processes are done?
>>
>> If possible, please apply "ccmp" feature before all those processes.
> 
> On a second thought, now I think it is more efficient and compliant to
> the unicode standard to apply "ccmp" feature after decomposition of
> hangul syllables and before setting syllable boundaries.
> 
>> And "*jmo" features after all those processes.
>>
>>> Are the '*jmo' features applied to all glyphs?
>>
>> No. Only to those well-formed syllable block <M? L V T?>.
>>
>>>
>>> On 13-04-16 11:29 PM, Dohyun Kim wrote:
>>>> http://ktug.org/~nomos/harfbuzz-hangul/hangulshaper.pdf
>>>>
>>>> Regards,
>>>>
>>>> 2013/4/17 Behdad Esfahbod <behdad at behdad.org>:
>>>>> Ok, given how confusing this thread has become, please create a Google Doc,
>>>>> and write down what you think the HarfBuzz Hangul shaper should do.  Modify it
>>>>> as much as you want, but keep it as short as possible.  Please make the doc
>>>>> commentable by the public, and send the link here.
>>>>>
>>>>> Thanks,
>>>>> behdad
>>>>>
>>>>> On 13-04-16 10:10 AM, Dohyun Kim wrote:
>>>>>> 2013/4/16 Dohyun Kim <nomosnomos at gmail.com>:
>>>>>>> 2013/4/15 Dohyun Kim <nomosnomos at gmail.com>:
>>>>>>>>
>>>>>>>> The behavior of new Uniscribe is quote confusing and seems to be
>>>>>>>> inconsistant on some points.  I cannot describe concisely what it
>>>>>>>> does.  But it is evident that it renders correctly only those input
>>>>>>>> sequence which is compliant to KS X 1026-1.
>>>>>>>>
>>>>>>>
>>>>>>> OK.  My guess about the behavior of new Uniscribe:
>>>>>>>
>>>>>>> 1.  demarcate syllable blocks according to KS X 1026-1
>>>>>>>
>>>>>>>    - between L and L, V and V, T and T, or L and T (these are illegal string)
>>>>>>>    - between V and L, T and L, or M and L (these are legal break point)
>>>>>>>    - between Jamo and non-Jamo character including Hangul syllables
>>>>>>>    - but not between L and V, V and T, T and M, V and M, LVT and M, LV and M.
>>>>>>
>>>>>> Oh, I have left out one stunning thing.  I really dislike this sort of behavior:
>>>>>>
>>>>>>    - The Jamo sequence of <L V T> is divided into <L | V | T>, if
>>>>>> equivalent <LVT> syllable exists.
>>>>>>    - Likewise, <L V> sequence is divided into <L | V>, if it is not
>>>>>> followed by T and equivalent <LV> syllable exists.
>>>>>>
>>>>>>>
>>>>>>> where LVT and LV are Hangul syllables; L, V, and T are Jamos; M means
>>>>>>> Hangul tone marks (U+302E or U+302F)
>>>>>>>
>>>>>>> 2.  reorder Hangul tone marks
>>>>>>>
>>>>>>>     - if syllable block is well-formed, move M from the last to the
>>>>>>> first of the cluster.
>>>>>>>     - if syllable is not well-formed, Uniscribe does not move M.
>>>>>>> Instead, U+25CC is inserted after M.
>>>>>>>
>>>>>>> where "well-formed" means <LVT>, <LV>, <L V T>, or <L V>.
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dohyun Kim
>>>>>> College of Law, Dongguk University
>>>>>> Seoul, Republic of Korea
>>>>>>
>>>>>
>>>>> --
>>>>> behdad
>>>>> http://behdad.org/
>>>>
>>>>
>>>>
>>>
>>> --
>>> behdad
>>> http://behdad.org/
>>
>>
>>
>> --
>> Dohyun Kim
>> College of Law, Dongguk University
>> Seoul, Republic of Korea
> 
> 
> 
> --
> Dohyun Kim
> College of Law, Dongguk University
> Seoul, Republic of Korea
> 

-- 
behdad
http://behdad.org/


More information about the HarfBuzz mailing list