[HarfBuzz] Hangul GSUB features

Sat Jan 25 11:17:47 PST 2014

On 25/1/14 17:36, mskala at ansuz.sooke.bc.ca wrote:

>>>      * Conditional on some assessment of the structure of the syllable
>>>        (perhaps the existence of a precomposed glyph?) the *jmo features may
>>>        be applied - presumably to the output of ccmp, if it was applied.
>>
>> Yes - remembering that the decision as to which *jmo feature, if any, applies
>> to a given glyph was made *before* ccmp, and knows nothing about any changes
>> that happened there.
>
> What happens to these decisions when ccmp make substitutions?  If we have
> a single glyph L tagged for ljmo and ccmp replaces it with a single glyph,
> is the new glyph also tagged for ljmo?

Yes.

> If we have something like L tagged
> for ljmo followed by LV not tagged, and ccmp replaces the pair of them
> with a single LLV glyph, will the LLV glyph be tagged?

Yes (at least, I think that's right - it'd be worth double-checking). 
However, note that if you have, say, LV (not tagged for any *jmo 
feature) followed by T (tagged tjmo) and replace the pair with LVT, I 
don't think the resulting LVT will inherit the tjmo. When GSUB does a 
many-to-one substitution, the result inherits the feature flags of the 
first glyph in the input sequence, and the feature flags of the 
subsequent glyph(s) are lost.

> If we have
> something like a single LLL glyph tagged for ljmo (the shaper would do
> that, right?) and ccmp splits it into three glyphs L L L, which if any of
> the new glyphs will inherit the tagging status of the original?

Yes. One-to-many will duplicate the features of the one to its many 
replacements.

The two problems you're facing, I think, with the current harfbuzz code 
in relation to the use of *jmo in your font are that:

(a) precomposed characters (LV, LVT) do not get tagged for any *jmo 
features, and if you decompose them with ccmp, the resulting glyphs 
still aren't tagged for *jmo (unlike the case where the shaper 
decomposes them); and

(b) sequences with multiple L, V and/or T jamos are not recognized as 
matching the <L, V [,T]?> pattern, and so do not get tagged for *jmo. In 
something like <L, L, L, V, V, V, T, T, T>, the only two glyphs that 
would be tagged for *jmo features would be the adjacent <L, V> pair; all 
the rest would be considered "not part of a valid syllable" and left 
untagged.

But if you ignore the *jmo features altogether, and do everything in a 
series of ccmp lookups, I don't see why it shouldn't work as you intend.

>>> If I go this route, defining no *jmo tables, can I depend on ccmp and liga
>>> always being applied and always in that order?
>>
>> Currently, at least in harfbuzz, ccmp and liga (and the *jmo features, when
>> used) are all applied "together", with the order of lookups being their order
>
> What does applying them "together" mean?  Is it just that nothing other
> than feature application is done in between applying features, or are
> they somehow simultaneous?  In other words, does the output of each one
> become the input of the next, or are they all looking at the same input
> with the output somehow recombined?

What actually happens is more like the description in 
http://www.microsoft.com/typography/otspec/chapter2.htm:

"After choosing which features to use, the client assembles all lookups 
from the selected features. Multiple lookups may be needed to define the 
data required for different substitution and positioning actions, as 
well as to control the sequencing and effects of those actions.
To implement features, a client applies the lookups in the order the 
lookup definitions occur in the LookupList. As a result, within the GSUB 
or GPOS table, lookups from several different features may be 
interleaved during text processing."

So for the L glyph in an <L, V, T> sequence, for example, the selected 
features will include ljmo, as well as the "global" features ccmp and 
liga (and others such as rlig, locl, etc.) We collect a list of all the 
lookups from all these features, and apply those lookups in the order 
they're defined in the font's LookupList, *not* in any predetermined 
feature order.

Some shapers - particularly the Indic one - do apply features in 
separate passes, because (unfortunately) that's how Microsoft chose to 
implement their Indic fonts and shaper, but we have not found this to be 
necessary for Hangul, and would prefer to avoid it.

>
> If I have glyphs L V T, with features ljmo and vjmo run in that order
> (glyph L tagged for ljmo and glyph V tagged for vjmo), and I want ljmo to
> change L into L.alt and vjmo to change V into V.alt, should vjmo contain a
> rule like "sub L.alt V' T" or like "sub L V' T"?

As you'll see from the above, this depends on how you order the lookups 
(rather than on a fixed feature order imposed by the shaper).

>
> I thought that with multiple lookups in a single feature, substitution
> would still stop as soon as it found a match - so that the multiple
> lookups have the same effect as a single long lookup, with the advantages
> over really using a single long lookup being that using more than one
> allows sharing parts of tables among separate features, and splitting into
> more than one table allows representing runs of simpler rules in more
> concise table formats.
>
> But some quick experiments with FontForge suggest that in fact (at least
> in FontForge) it's as you imply:  with multiple lookups in a feature, each
> one is applied to the output of the previous one.  Thanks for bringing
> that to my attention!  It will make things a lot easier for me.

Perhaps you were confusing this with the case of multiple *subtables* 
within a single *lookup*. In this case, once a match occurs in one of 
the subtables, the lookup is considered to have finished, and the 
following subtables are not applied.

But multiple *lookups* within a single *feature* are definitely 
supported and used.

>
> Something else I hadn't realized, but have just now verified at least in
> the case of FontForge, was that the order of tables in the font can
> override the "ccmp must be applied first" rule.  I thought that was
> advice for renderers, but apparently it's the font's responsibility to
> implement it by putting ccmp first in the file.

Yes - again, see above.

I have not tested whether Uniscribe behaves this way for Hangul, or 
whether it runs the features separately (as seems to be implied by the 
old documentation). Provided you design your lookups to be applied in 
the documented ccmp/ljmo/vjmo/tjmo/liga order *and* arrange the lookups 
this way in the font, it shouldn't matter whether the shapers run them 
"all at once" according to the generic OpenType spec or in separate passes.

JK