[HarfBuzz] Hangul GSUB features

Sat Jan 25 07:34:57 PST 2014

On 24/1/14 19:26, mskala at ansuz.sooke.bc.ca wrote:
> Hi, I'm the maintainer of the Jieubsida fonts.  Dohyun Kim kindly drew my
> attention to the recent discussion on this list of changes to HarfBuzz's
> hangul support and how it relates to these fonts, and I wanted to make some
> comments and ask some questions.  This is a lengthy message, but I'm trying
> to be very specific about the details, because those are important.

Hi Matthew - Thanks for your message, and for working through the 
details so carefully. I'll try to respond and clarify where I can...

> These fonts are intended to be able to typeset the full range of hangul
> defined in Unicode - including both the precomposed syllable code points and
> the (basic and extended) individual jamo.  So I want to be able to
> typeset all these code point sequences, and typeset them identically, using
> a single glyph that is a precomposed syllable:
>
>     1. U+1100 U+1161 U+11B7 (choseong-kiyeok jungseong-a jongseong-mieum)
>     2. U+AC00 U+11B7        (syllable-ga jongseong-mieum)
>     3. U+AC10               (syllable-gam)
>
> I'm not an expert on Unicode canonical equivalence, but I believe these
> three sequences are canonically equivalent to each other under the rules
> in sections 3.7 and 3.12 of the current Unicode standard
> (http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf).  Sequence 1 is the
> canonical decomposition of all three.  If I'm reading the discussion of the
> last few days correctly, it sounds like we're all more or less in agreement
> on that.

Yes. These are defined to be canonically equivalent (now and forever, as 
Unicode stability policies prohibit any change), and therefore I believe 
it is appropriate for all three to be rendered identically.

(Incidentally, IIRC the "semi-composed" version (2) above does not 
currently work in Windows/Uniscribe. I consider that a defect, and am 
glad to note that harfbuzz does handle it correctly.)

>
> I would also like to be able to typeset the extended compound jamo as nicely
> as possible.  For instance, I would like these two sequences to both be
> typeset with a single glyph that is a precomposed lead jamo cluster, to be
> overlaid with additional glyphs for subsequent code points that would
> describe the vowel and tail of the syllable:
>
>     4. U+1107 U+1109 U+1110 (choseong-pieup choseong-sios choseong-thieuth)
>     5. U+A972               (choseong-pieup-sios-thieuth)
>
> Exactly which glyph is used for these two sequences should be
> context-sensitive, determined by the following vowel and presence or absence
> of a tail.  It looks to me like these may not be canonically equivalent
> under Unicode; U+A972 does not canonically decompose, and I don't think
> there is such a thing as canonical composition of jamo.  Nonetheless it
> certainly appears that they should be understood as the same text,
> describing the same fragment of a syllable.

This is a trickier area. As you note, these two sequences are *not* 
equivalent from a Unicode point of view, even though they "obviously" 
(to a human) describe the same fragment of text.

>
> On Mon Jan 20, Jonathan Kew writes:
>> Is this actually important? Note that Windows behaves similarly, and so
>> data that has "spelled-out" representations of complex jamos won't work
>> there either. AIUI, the recommended practice is to use the precomposed
>> Unicode characters such as U+A972 directly - and because these do *not*
>> have decompositions, mixing the two forms will lead to confusion and
>> problems for users. Perhaps it's better that the non-preferred spelling
>> does not render "correctly".
>
> Even if it's rare or discouraged for anyone to attempt to typeset sequences
> like number 4 above, and even if Windows is broken, I would prefer that such
> sequences should render correctly with my fonts and HarfBuzz.

If the sequences (4) and (5) were canonically equivalent, I would of 
course agree wholeheartedly with this (it would be in the same category 
as (1)-(3) above). However, in Unicode terms they are not equivalent; 
moreover, the relevant Korean standard, at least, makes it clear that 
(5) is to be regarded as correct, and (4) should not be used.

Because (4) and (5) are not canonically equivalent, they will not 
*function* as equivalents in general-purpose Unicode-based software, 
even when that software is careful to respect Unicode rules for 
equivalence (e.g. by normalizing text prior to operations such as 
search, indexing, etc.). Searching a document for a syllable that 
contains U+A972 will fail to find that "same" syllable if it was spelled 
using U+1107 U+1109 U+1110.

And because these sequences are not equivalent, and will not be folded 
together during normalization or other Unicode-aware operations, I think 
we're actually doing users a *disservice* and hurting the reusability of 
data if we force them to display the same. This will mislead users into 
expecting interoperable behavior that will not actually work.

>
> The way the Jieubsida fonts are currently intended to work is that after the
> cmap table translates code points into a stream of glyphs, the code point
> stream goes through the ccmp, ljmo, vjmo, and liga tables in that order.
>
> In ccmp, the glyphs representing precomposed syllables like U+AC00 and
> U+AC10 are split into their component jamo, and the glyphs representing
> individual jamo are joined into glyphs representing clusters, where
> possible.  Note that these tables of course operate on glyphs, not code
> points - which becomes important later, when there are multiple glyphs for
> the same nominal code point.  Although this isn't a deliberate design
> feature, I think this table's effect is very similar to Unicode
> canonicalization.  After this table, my code point sequences 1, 2, and 3
> should all translate to glyph sequences "uni1100 uni1161 uni11B7" and 4 and
> 5 to the single glyph "uniA972".
>
> In the ljmo table, glyphs for lead (choseong) jamo are substituted depending
> on the shape of the vowel (jungseong) and whether there is a tail
> (jongseong) jamo.  In the case of "uni1100 uni1161 uni11B7", the vowel is in
> the "vertical" class and there is a tail, so the table selects the "layout
> 1" variant and the glyph sequence becomes "uni1100.l1 uni1161 uni11B7".
>
> In the vjmo table, glyphs for the vowel may be substituted similarly.  In
> the particular case of "uni1100.l1 uni1161 uni11B7", the default glyph for
> U+1161 is correct for layout 1 and so there's no change.  If there were no
> tail, it would choose a different layout including a substitution for
> uni1161.
>
> Finally, in the liga table, any sequences for which precomposed glyphs exist
> are replaced by the precomposed glyphs.  Since there is a "uniAC10" glyph
> corresponding to the sequence "uni1100.l1 uni1161 uni11B7", it will be used.
> At this point all three of my sequences 1, 2, and 3 are typeset the way I
> want them and that's great.
>
> But some things to note:  if ljmo is not applied, then "uni1100" will not
> change to "uni1100.l1" and then liga will not substitute uniAC10, so all
> three sequences break.  If ccmp is not applied at all, then "uniAC00" will
> not change to "uni1100 uni1161", none of the subsequent tables will match,
> and sequence 2 breaks.  If ccmp is applied, but is not applied FIRST, then
> there again the other tables will not see the glyph sequences they're
> expecting, and again sequence 2 breaks.  If liga is not applied, then
> (assuming everything else happens as expected) we end up with "uni1100.l1
> uni1161 uni11B7" - typesetting the syllable in "layout 1" as if there were
> no precomposed glyph, which will look okay but not as good as the
> precomposed glyph should (because the precomposed glyph has a more
> finely-adjusted layout).
>
> My code point sequences 4 and 5 don't describe a full syllable, but if one
> constructs a full syllable by adding one or more vowel and possibly tail
> jamo, it will go through a similar process minus the precomposed-syllable
> substitution at the end, because I have no precomposed syllables starting
> with "pieup-sios-thieuth".  If ccmp runs and runs first, the result of the
> whole process should look okay.  If ccmp does not run, then sequence 5
> will result in good typesetting and sequence 4 won't; if ccmp runs but
> does not run first, then sequence 5 may also end up incorrect depending on
> the other jamo in the syllable.
>
> The scheme above does everything I want it to do, with the versions of the
> software I'm currently using.  With all due respect, it looks like you're
> about to change HarfBuzz so that my fonts will no longer work, to tell me
> that it's my own fault because I was doing it wrong all along, and to
> suggest a way for me to redesign my fonts at considerable effort that will,
> by design, not correctly handle all the cases the old one could correctly
> handle.  This doesn't sound good to me, and I hope a better resolution is
> possible.
>
> On Thu Jan 23, Jonathan Kew writes:
>> So I think this is a font error. The font is using ccmp to decompose the
>> syllable AC00 into L and V jamos, but then expecting the shaper to apply
>> *jmo features to the resulting glyphs. That doesn't work, because
>
> That is (as far as it goes) a correct description of what I expected the
> shaper to do.  It's also what current XeTeX [using an older HarfBuzz], older
> XeTeX [using ICU], and FontForge [using its own code] all seem to do if the
> appropriate features are turned on.  It's not clear whether the need to turn
> the appropriate features on is because those pieces of software don't
> support Korean at all, or because they do support Korean and are correctly
> not invoking the features under some rule I've been unaware of.  Until now I
> always thought it was because of a complete absence of support.
>
> Microsoft's documentation on ccmp at
>     https://www.microsoft.com/typography/otspec/features_ae.htm#ccmp
> says:
>
> # Tag: “ccmp”
> #
> # Friendly name: Glyph Composition/Decomposition
> #
> # Registered by: Microsoft
> #
> # Function: To minimize the number of glyph alternates, it is sometimes
> # desired to decompose a character into two glyphs. Additionally, it may be
> # preferable to compose two characters into a single glyph for better glyph
> # processing. This feature permits such composition/decompostion. The feature
> # should be processed as the first feature processed, and should be processed
> # only when it is called.
> #
> # Example: In Syriac, the character 0x0732 is a combining mark that has a dot
> # above AND a dot below the base character. To avoid multiple glyph variants
> # to fit all base glyphs, the character is decomposed into two glyphs...a dot
> # above and a dot below. These two glyphs can then be correctly placed using
> # GPOS. In Arabic it might be preferred to combine the shadda with fatha
> # (0x0651, 0x064E) into a ligature before processing shapes. This allows the
> # font vendor to do special handling of the mark combination when doing
> # further processing without requiring larger contextual rules.
> #
> # Recommended implementation: The ccmp table maps the character sequence to
> # its corresponding ligature (GSUB lookup type 4) or string of glyphs (GSUB
> # lookup type 2). When using GSUB lookup type 4, sequences that are made up of
> # larger number of glyphs must be placed before those that require fewer
> # glyphs.
> #
> # Application interface: For GIDs found in the ccmp coverage table, the
> # application passes the sequence of GIDs to the table, and gets back the GID
> # for the ligature, or GIDs for the multiple substitution.
> #
> # UI suggestion: This feature should be on by default.
> #
> # Script/language sensitivity: None.
> #
> # Feature interaction: This feature needs to be implemented prior to any other
> # feature.
>
> Note that it's not specific to any particular language, it's described as
> something that should always run, and it's described as running before any
> other feature.  Adobe's version of the specification says pretty much the
> same thing.  Microsoft's language-specific documentation for Korean at
>    https://www.microsoft.com/typography/OpenTypeDev/hangul/intro.htm
> also repeatedly describes ccmp as running before *jmo features, although it
> also uses language like "Apply feature 'ccmp' to preprocess any glyphs that
> require composition" which seems to imply that ccmp might not always run.
> It does not mention any possibility of the *jmo features not running.
>
> It's because of these documents, with checking against XeTeX and FontForge,
> that I've written the Jieubsida substitution features the way I have.  It
> sounds like HarfBuzz's intended architecture works something like
> this, which is significantly different from the "always run ccmp, ljmo,
> vjmo, and liga, in that order" my code currently expects:
>
>     * Some sort of composition or decomposition is applied at the level of
>       code points (not glyphs) to find syllable boundaries.  This operation
>       is not intended to handle sequences of single jamo joining to form
>       compound jamo such as my sequence 4 above.  The mapping at this stage
>       is part of the "shaper" and not specified by the font.
>     * The code points, and recognized syllables, are translated to glyphs by
>       cmap.  If precomposed glyphs exist, they are used directly; otherwise
>       the glyph stream consists of L, V, T triples (T allowed to be null),
>       with the expectation that clusters (more than one jamo in a single
>       L/V/T slot) were already combined in the input.

Yes (or a precomposed LV glyph may be used, if there was no following T 
with which the L and V may need to interact).

At this stage, individual L, V and T glyphs are tagged with the 
appropriate *jmo feature that is to be applied. Precomposed (LV, LVT) 
glyphs do not get any of the *jmo features.

>     * It is not clear to me whether the ccmp table is applied unconditionally
>       at this point, nor what the conditions for it are if it's conditional.

ccmp is applied unconditionally to all the glyphs (but remember that 
canonical composition or decomposition may have occurred already at the 
character level).

Note that if ccmp composes or decomposes glyphs, this will *not* affect 
which *jmo features are going to be applied; that was already decided by 
the shaper based on its analysis above. (The normal expectation is that 
a Hangul font should not actually have any need for ccmp.)

>     * Conditional on some assessment of the structure of the syllable
>       (perhaps the existence of a precomposed glyph?) the *jmo features may
>       be applied - presumably to the output of ccmp, if it was applied.

Yes - remembering that the decision as to which *jmo feature, if any, 
applies to a given glyph was made *before* ccmp, and knows nothing about 
any changes that happened there.

>     * It is not clear to me under what circumstances liga may be applied.

liga is always applied (although a Hangul font wouldn't usually be 
expected to need it). Also, note that liga is intended to be under user 
control; although it's enabled by default, authors may turn it off 
(directly, or as a side-effect of other styling). You probably don't 
want your basic Hangul support to break when ligatures are disabled.

>
> So my first real questions are:  what exactly does HarfBuzz intend to do?
> Is the above description correct as far as it goes, and if not, what would
> be a correct description?  What are the answers to the unknown points?
>
> What processing happens before code points change into glyphs?  Under what
> circumstances will ccmp be applied to the glyph stream?  Under what
> circumstances will *jmo be applied, and will the input to *jmo be the output
> of ccmp (should it be applied) or something else?  Under what circumstances
> will liga be applied?
>
> On a meta-level:  where (or if) HarfBuzz's intended design differs from what
> I think the standards require (such points as "ccmp always runs, and is
> always first"), am I reading the wrong standards?  Is HarfBuzz's behaviour
> based on an authority like a standard, stronger than the observed behaviour
> of other software such as Uniscribe?  Or if it's based on the observed
> behaviour of other software, which other software and why?  Are these points
> documented anywhere?

There's the Hangul shaping document at 
http://www.microsoft.com/typography/OpenTypeDev/hangul/intro.htm#features, 
but it's unclear and outdated in various respects.

In particular, it does not explicitly state whether the *jmo features 
are applied globally, or only to glyphs that the shaper identified as 
being in the correct place within a valid syllable. I believe the 
intended meaning (and observed Uniscribe behavior) is that these 
features are *selectively* applied to the individual glyphs only when 
they are found in an <L, V [, T]> sequence.

The ICU implementation, at least (and perhaps old HarfBuzz?), applied 
the *jmo features to L, V and T glyphs in a more general sequence of the 
form <L+, V+, T*>. This is why a "spelled-out" form such as your (4) 
above would have worked there; the ljmo feature was applied to all three 
L characters. However, it also means (AIUI) that the feature will be 
applied to a sequence of 4, 5, or even more Ls in succession, the ljmo 
feature will be applied even to those that cannot be part of a valid 
syllable and would be better left in their original form.

>
> I would much prefer to have a clear description of what HarfBuzz is trying
> to do and why, over advice on what Mandeubsida should do.  I don't expect
> HarfBuzz's developers to alter their design to match what I think it should
> be, not even if I think the standards may mandate such an alteration, and
> I'm wary of altering my own design to suit a third-party package in
> preference to my own reading of the standards.  Nonetheless, it sounds like
> HarfBuzz developers do have some ideas regarding what I ought to do, and
> since I want my fonts to work with HarfBuzz, those ideas are worth
> thinking about.
>
> On Thu Jan 23, Jonathan Kew writes:
>> So the font is using the wrong strategy. It should be simplified to
>> remove the syllable decompositions from ccmp; that's handled by the
>> shaper itself. (And it doesn't need the liga feature to reassemble the
>> original syllables, either, as the shaper won't decompose them unless
>> actually necessary, e.g. to support an <LV, T> sequence.)
>
> If I'm understanding HarfBuzz's intended operation and this description
> correctly, my sequence 3 (a single precomposed syllable) will be recognized
> as a precomposed syllable, NOT decomposed, and will go directly through to
> the precomposed glyph; that's fine.

Yes.

> Sequences 2 (precomposed syllable plus
> a tail) and 1 (separate lead, vowel, and tail, one of each) will be
> recognized by the shaper (not by ccmp or liga) as adding up to a precomposed
> syllable.  It's not clear to me whether then HarfBuzz will attempt to run
> them through the *jmo features, but my guess is not - instead it will go
> directly to the uniAC10 precomposed glyph.  That's good too.

Right. Provided the precomposed character is supported by the font, it 
will be used (and no *jmo features applied).

> So far it
> sounds like I can get the desired behaviour just by removing the ccmp table,
> and the recombination mappings from the liga table.  Less code needed from
> me, still correct results, that's great.

I believe so, yes.

>
> With sequence 5 (a cluster of lead jamo expressed as a single code point),
> the desired behaviour is one glyph each for the cluster lead, the vowel, and
> the tail if any, with the lead and vowel substituted in a context-sensitive
> way depending on the shape of the vowel and presence or absence of a tail.
> That appears to be the case in which HarfBuzz will invoke *jmo features to
> choose the right context-sensitive glyphs; but it's not clear to me exactly
> what the input to these features will look like.  Presumably with
> documentation or experiments, I can figure that out.  I may be lucky enough
> to find that the current substitution tables will work unmodified.

The L, V and T jamos will each be mapped to its default glyph via the 
cmap, and the respective ljmo, vjmo and tjmo features will be applied to 
those.

Except that if you had a ccmp that broke the complex lead jamo into 
three separate glyphs, that will presumably have been applied already. I 
think the ljmo feature would then get applied to all three of the simple 
L glyphs, though I haven't double-checked this.

>
> With sequence 4 (multiple lead jamo expressed as single jamo code points,
> resulting in a single glyph for the cluster, chosen context-sensitively) it
> appears that HarfBuzz is not intended to support that case, and the strategy
> described above should not be expected to produce correct results with this
> code point sequence.

Right; this sequence is not currently intended to be supported. As 
discussed above, I am not convinced supporting this is a good thing 
overall, because of its non-equivalence to sequence (5), its 
incompatibility with Windows behavior, and its invalidity according to 
the relevant Korean standard.

> Note, also, that making the changes necessary to get
> correct behaviour from the new HarfBuzz in the more common cases, will
> apparently result in fonts that do not work on software (including earlier
> versions of HarfBuzz) where the current Jieubsida fonts do work, even in the
> more common cases.  These points are issues for me.

I believe you could make the fonts continue to work (in both old and new 
HarfBuzz, ICU, etc) by simply moving *all* your lookups into the ccmp 
feature, and ignoring the Hangul-specific *jmo features altogether. Then 
(AIUI) they'd be applied to all the text, just as you expected, and it 
would be entirely up to your (context-sensitive) lookups to decompose, 
choose forms, recompose, etc., as desired.

However, I don't actually recommend doing this; I think it's better for 
the long-term interests of the Korean user community, Korean data on the 
Web and elsewhere, etc., for everyone to conform to the current 
recommendation - as enshrined in Korean standards and implemented in 
Windows - that sequence (5) should be used, and not (4).

>
> On Thu Jan 23, Jonathan Kew also writes:
>> The font should *not* use the generic ccmp feature to
>> decompose it, unless it intends to do *everything* using generic global
>> features, not the hangul-specific features.
>
> Doing everything using generic global features may in fact be the best
> solution for me.  Inasmuch as an OpenType contextual substitution table is a
> finite-state transducer and such things are closed under composition, I can
> reduce the current sequence of four tables which I want to all be applied
> every time, to a sequence of fewer than four, maybe even just one table -
> the size of that table may explode, but I can generate it algorithmically.
> If I go this route, defining no *jmo tables, can I depend on ccmp and liga
> always being applied and always in that order?

Currently, at least in harfbuzz, ccmp and liga (and the *jmo features, 
when used) are all applied "together", with the order of lookups being 
their order in the font. This is the generic standard OpenType behavior 
(see "Features and Lookups", in 
http://www.microsoft.com/typography/otspec/chapter2.htm), and gives you 
as font designer control over how the lookups interact. Some shapers 
override this, and apply features individually (or in smaller groups), 
but we try to avoid doing so unless required for compatibility with 
Uniscribe behavior.

So yes, you can depend on ccmp being applied. You shouldn't actually be 
depending on liga for any of this, because it may be disabled due to 
user styling - e.g. when letter-spacing is used in Firefox, at least, 
liga is disabled - that would not normally be expected to break basic 
script rendering.

> Is there some longer
> sequence of global tables I can depend on always being applied and always in
> a specific order?

Remember that you can have a whole sequence of lookups within a single 
feature; you don't need multiple features to achieve this.

> Will the "shaper", even in the absence of *jmo tables,
> perform some translations on the sequence of code points that I need to know
> about in building my substitution table(s)?

Yes; as described earlier, it will replace <L, V [, T]> and <LV, T> 
sequences with precomposed syllables where possible; and it will also 
decompose <LV, T> to <L, V, T> if a suitable <LVT> does not exist. 
However, I don't think this should matter to you, as your tables are 
presumably designed to support these equivalents anyway.

>
> Ever since attending Jin-Hwan Cho's talk at TUG 2013, it's been on my to-do
> list to take a close look at Dohyun Kim's work in the HCR fonts.  Maybe now
> is a good time for me to to do that.  I think the HCR fonts have a much
> different architecture from mine because of using no precomposed syllables,
> and many more on-the-fly layouts and jamo variants.  (I don't know if I
> clearly addressed a question from Jin-Hwan Cho in our discussions at the
> conference:  my fonts have at most five variants of each jamo, far fewer than
> HCR, *but* I only use those variants at all when there's no precomposed
> syllable.  The number of variants built into the precomposed syllables is
> far greater.)  Presumably the HCR fonts have to solve similar problems to
> mine of interacting predictably with the "shaper" and working well on a wide
> range of software, so their solutions may be useful.
>
>
>
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>