[HarfBuzz] Hangul GSUB features

Fri Jan 24 11:26:14 PST 2014

Hi, I'm the maintainer of the Jieubsida fonts.  Dohyun Kim kindly drew my
attention to the recent discussion on this list of changes to HarfBuzz's
hangul support and how it relates to these fonts, and I wanted to make some
comments and ask some questions.  This is a lengthy message, but I'm trying
to be very specific about the details, because those are important.

First of all, note that the name of the Korean fonts in the Tsukurimashou
project will probably be changing from "Jieubsida" to "Mandeubsida" in the
next version, in an effort to make it a better translation.

These fonts are hoped to be useful for typesetting Korean, but the
Tsukurimashou project is primarily focused on Japanese; a big part of
the Korean extension's purpose is to serve as a testbed for scaling the
associated software tools.  My own knowledge of the Korean language is
very limited.  As such, although I want it to be correct, I'm not eager to
sink huge amounts of time into maintenance.

These fonts are intended to be able to typeset the full range of hangul
defined in Unicode - including both the precomposed syllable code points and
the (basic and extended) individual jamo.  So I want to be able to
typeset all these code point sequences, and typeset them identically, using
a single glyph that is a precomposed syllable:

   1. U+1100 U+1161 U+11B7 (choseong-kiyeok jungseong-a jongseong-mieum)
   2. U+AC00 U+11B7        (syllable-ga jongseong-mieum)
   3. U+AC10               (syllable-gam)

I'm not an expert on Unicode canonical equivalence, but I believe these
three sequences are canonically equivalent to each other under the rules
in sections 3.7 and 3.12 of the current Unicode standard
(http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf).  Sequence 1 is the
canonical decomposition of all three.  If I'm reading the discussion of the
last few days correctly, it sounds like we're all more or less in agreement
on that.

I would also like to be able to typeset the extended compound jamo as nicely
as possible.  For instance, I would like these two sequences to both be
typeset with a single glyph that is a precomposed lead jamo cluster, to be
overlaid with additional glyphs for subsequent code points that would
describe the vowel and tail of the syllable:

   4. U+1107 U+1109 U+1110 (choseong-pieup choseong-sios choseong-thieuth)
   5. U+A972               (choseong-pieup-sios-thieuth)

Exactly which glyph is used for these two sequences should be
context-sensitive, determined by the following vowel and presence or absence
of a tail.  It looks to me like these may not be canonically equivalent
under Unicode; U+A972 does not canonically decompose, and I don't think
there is such a thing as canonical composition of jamo.  Nonetheless it
certainly appears that they should be understood as the same text,
describing the same fragment of a syllable.

On Mon Jan 20, Jonathan Kew writes:
> Is this actually important? Note that Windows behaves similarly, and so
> data that has "spelled-out" representations of complex jamos won't work
> there either. AIUI, the recommended practice is to use the precomposed
> Unicode characters such as U+A972 directly - and because these do *not*
> have decompositions, mixing the two forms will lead to confusion and
> problems for users. Perhaps it's better that the non-preferred spelling
> does not render "correctly".

Even if it's rare or discouraged for anyone to attempt to typeset sequences
like number 4 above, and even if Windows is broken, I would prefer that such
sequences should render correctly with my fonts and HarfBuzz.

The way the Jieubsida fonts are currently intended to work is that after the
cmap table translates code points into a stream of glyphs, the code point
stream goes through the ccmp, ljmo, vjmo, and liga tables in that order.

In ccmp, the glyphs representing precomposed syllables like U+AC00 and
U+AC10 are split into their component jamo, and the glyphs representing
individual jamo are joined into glyphs representing clusters, where
possible.  Note that these tables of course operate on glyphs, not code
points - which becomes important later, when there are multiple glyphs for
the same nominal code point.  Although this isn't a deliberate design
feature, I think this table's effect is very similar to Unicode
canonicalization.  After this table, my code point sequences 1, 2, and 3
should all translate to glyph sequences "uni1100 uni1161 uni11B7" and 4 and
5 to the single glyph "uniA972".

In the ljmo table, glyphs for lead (choseong) jamo are substituted depending
on the shape of the vowel (jungseong) and whether there is a tail
(jongseong) jamo.  In the case of "uni1100 uni1161 uni11B7", the vowel is in
the "vertical" class and there is a tail, so the table selects the "layout
1" variant and the glyph sequence becomes "uni1100.l1 uni1161 uni11B7".

In the vjmo table, glyphs for the vowel may be substituted similarly.  In
the particular case of "uni1100.l1 uni1161 uni11B7", the default glyph for
U+1161 is correct for layout 1 and so there's no change.  If there were no
tail, it would choose a different layout including a substitution for
uni1161.

Finally, in the liga table, any sequences for which precomposed glyphs exist
are replaced by the precomposed glyphs.  Since there is a "uniAC10" glyph
corresponding to the sequence "uni1100.l1 uni1161 uni11B7", it will be used.
At this point all three of my sequences 1, 2, and 3 are typeset the way I
want them and that's great.

But some things to note:  if ljmo is not applied, then "uni1100" will not
change to "uni1100.l1" and then liga will not substitute uniAC10, so all
three sequences break.  If ccmp is not applied at all, then "uniAC00" will
not change to "uni1100 uni1161", none of the subsequent tables will match,
and sequence 2 breaks.  If ccmp is applied, but is not applied FIRST, then
there again the other tables will not see the glyph sequences they're
expecting, and again sequence 2 breaks.  If liga is not applied, then
(assuming everything else happens as expected) we end up with "uni1100.l1
uni1161 uni11B7" - typesetting the syllable in "layout 1" as if there were
no precomposed glyph, which will look okay but not as good as the
precomposed glyph should (because the precomposed glyph has a more
finely-adjusted layout).

My code point sequences 4 and 5 don't describe a full syllable, but if one
constructs a full syllable by adding one or more vowel and possibly tail
jamo, it will go through a similar process minus the precomposed-syllable
substitution at the end, because I have no precomposed syllables starting
with "pieup-sios-thieuth".  If ccmp runs and runs first, the result of the
whole process should look okay.  If ccmp does not run, then sequence 5
will result in good typesetting and sequence 4 won't; if ccmp runs but
does not run first, then sequence 5 may also end up incorrect depending on
the other jamo in the syllable.

The scheme above does everything I want it to do, with the versions of the
software I'm currently using.  With all due respect, it looks like you're
about to change HarfBuzz so that my fonts will no longer work, to tell me
that it's my own fault because I was doing it wrong all along, and to
suggest a way for me to redesign my fonts at considerable effort that will,
by design, not correctly handle all the cases the old one could correctly
handle.  This doesn't sound good to me, and I hope a better resolution is
possible.

On Thu Jan 23, Jonathan Kew writes:
> So I think this is a font error. The font is using ccmp to decompose the
> syllable AC00 into L and V jamos, but then expecting the shaper to apply
> *jmo features to the resulting glyphs. That doesn't work, because

That is (as far as it goes) a correct description of what I expected the
shaper to do.  It's also what current XeTeX [using an older HarfBuzz], older
XeTeX [using ICU], and FontForge [using its own code] all seem to do if the
appropriate features are turned on.  It's not clear whether the need to turn
the appropriate features on is because those pieces of software don't
support Korean at all, or because they do support Korean and are correctly
not invoking the features under some rule I've been unaware of.  Until now I
always thought it was because of a complete absence of support.

Microsoft's documentation on ccmp at
   https://www.microsoft.com/typography/otspec/features_ae.htm#ccmp
says:

# Tag: “ccmp”
#
# Friendly name: Glyph Composition/Decomposition
#
# Registered by: Microsoft
#
# Function: To minimize the number of glyph alternates, it is sometimes
# desired to decompose a character into two glyphs. Additionally, it may be
# preferable to compose two characters into a single glyph for better glyph
# processing. This feature permits such composition/decompostion. The feature
# should be processed as the first feature processed, and should be processed
# only when it is called.
#
# Example: In Syriac, the character 0x0732 is a combining mark that has a dot
# above AND a dot below the base character. To avoid multiple glyph variants
# to fit all base glyphs, the character is decomposed into two glyphs...a dot
# above and a dot below. These two glyphs can then be correctly placed using
# GPOS. In Arabic it might be preferred to combine the shadda with fatha
# (0x0651, 0x064E) into a ligature before processing shapes. This allows the
# font vendor to do special handling of the mark combination when doing
# further processing without requiring larger contextual rules.
#
# Recommended implementation: The ccmp table maps the character sequence to
# its corresponding ligature (GSUB lookup type 4) or string of glyphs (GSUB
# lookup type 2). When using GSUB lookup type 4, sequences that are made up of
# larger number of glyphs must be placed before those that require fewer
# glyphs.
#
# Application interface: For GIDs found in the ccmp coverage table, the
# application passes the sequence of GIDs to the table, and gets back the GID
# for the ligature, or GIDs for the multiple substitution.
#
# UI suggestion: This feature should be on by default.
#
# Script/language sensitivity: None.
#
# Feature interaction: This feature needs to be implemented prior to any other
# feature.

Note that it's not specific to any particular language, it's described as
something that should always run, and it's described as running before any
other feature.  Adobe's version of the specification says pretty much the
same thing.  Microsoft's language-specific documentation for Korean at
  https://www.microsoft.com/typography/OpenTypeDev/hangul/intro.htm
also repeatedly describes ccmp as running before *jmo features, although it
also uses language like "Apply feature 'ccmp' to preprocess any glyphs that
require composition" which seems to imply that ccmp might not always run.
It does not mention any possibility of the *jmo features not running.

It's because of these documents, with checking against XeTeX and FontForge,
that I've written the Jieubsida substitution features the way I have.  It
sounds like HarfBuzz's intended architecture works something like
this, which is significantly different from the "always run ccmp, ljmo,
vjmo, and liga, in that order" my code currently expects:

   * Some sort of composition or decomposition is applied at the level of
     code points (not glyphs) to find syllable boundaries.  This operation
     is not intended to handle sequences of single jamo joining to form
     compound jamo such as my sequence 4 above.  The mapping at this stage
     is part of the "shaper" and not specified by the font.
   * The code points, and recognized syllables, are translated to glyphs by
     cmap.  If precomposed glyphs exist, they are used directly; otherwise
     the glyph stream consists of L, V, T triples (T allowed to be null),
     with the expectation that clusters (more than one jamo in a single
     L/V/T slot) were already combined in the input.
   * It is not clear to me whether the ccmp table is applied unconditionally
     at this point, nor what the conditions for it are if it's conditional.
   * Conditional on some assessment of the structure of the syllable
     (perhaps the existence of a precomposed glyph?) the *jmo features may
     be applied - presumably to the output of ccmp, if it was applied.
   * It is not clear to me under what circumstances liga may be applied.

So my first real questions are:  what exactly does HarfBuzz intend to do?
Is the above description correct as far as it goes, and if not, what would
be a correct description?  What are the answers to the unknown points?

What processing happens before code points change into glyphs?  Under what
circumstances will ccmp be applied to the glyph stream?  Under what
circumstances will *jmo be applied, and will the input to *jmo be the output
of ccmp (should it be applied) or something else?  Under what circumstances
will liga be applied?

On a meta-level:  where (or if) HarfBuzz's intended design differs from what
I think the standards require (such points as "ccmp always runs, and is
always first"), am I reading the wrong standards?  Is HarfBuzz's behaviour
based on an authority like a standard, stronger than the observed behaviour
of other software such as Uniscribe?  Or if it's based on the observed
behaviour of other software, which other software and why?  Are these points
documented anywhere?

I would much prefer to have a clear description of what HarfBuzz is trying
to do and why, over advice on what Mandeubsida should do.  I don't expect
HarfBuzz's developers to alter their design to match what I think it should
be, not even if I think the standards may mandate such an alteration, and
I'm wary of altering my own design to suit a third-party package in
preference to my own reading of the standards.  Nonetheless, it sounds like
HarfBuzz developers do have some ideas regarding what I ought to do, and
since I want my fonts to work with HarfBuzz, those ideas are worth
thinking about.

On Thu Jan 23, Jonathan Kew writes:
> So the font is using the wrong strategy. It should be simplified to
> remove the syllable decompositions from ccmp; that's handled by the
> shaper itself. (And it doesn't need the liga feature to reassemble the
> original syllables, either, as the shaper won't decompose them unless
> actually necessary, e.g. to support an <LV, T> sequence.)

If I'm understanding HarfBuzz's intended operation and this description
correctly, my sequence 3 (a single precomposed syllable) will be recognized
as a precomposed syllable, NOT decomposed, and will go directly through to
the precomposed glyph; that's fine.  Sequences 2 (precomposed syllable plus
a tail) and 1 (separate lead, vowel, and tail, one of each) will be
recognized by the shaper (not by ccmp or liga) as adding up to a precomposed
syllable.  It's not clear to me whether then HarfBuzz will attempt to run
them through the *jmo features, but my guess is not - instead it will go
directly to the uniAC10 precomposed glyph.  That's good too.  So far it
sounds like I can get the desired behaviour just by removing the ccmp table,
and the recombination mappings from the liga table.  Less code needed from
me, still correct results, that's great.

With sequence 5 (a cluster of lead jamo expressed as a single code point),
the desired behaviour is one glyph each for the cluster lead, the vowel, and
the tail if any, with the lead and vowel substituted in a context-sensitive
way depending on the shape of the vowel and presence or absence of a tail.
That appears to be the case in which HarfBuzz will invoke *jmo features to
choose the right context-sensitive glyphs; but it's not clear to me exactly
what the input to these features will look like.  Presumably with
documentation or experiments, I can figure that out.  I may be lucky enough
to find that the current substitution tables will work unmodified.

With sequence 4 (multiple lead jamo expressed as single jamo code points,
resulting in a single glyph for the cluster, chosen context-sensitively) it
appears that HarfBuzz is not intended to support that case, and the strategy
described above should not be expected to produce correct results with this
code point sequence.  Note, also, that making the changes necessary to get
correct behaviour from the new HarfBuzz in the more common cases, will
apparently result in fonts that do not work on software (including earlier
versions of HarfBuzz) where the current Jieubsida fonts do work, even in the
more common cases.  These points are issues for me.

On Thu Jan 23, Jonathan Kew also writes:
> The font should *not* use the generic ccmp feature to
> decompose it, unless it intends to do *everything* using generic global
> features, not the hangul-specific features.

Doing everything using generic global features may in fact be the best
solution for me.  Inasmuch as an OpenType contextual substitution table is a
finite-state transducer and such things are closed under composition, I can
reduce the current sequence of four tables which I want to all be applied
every time, to a sequence of fewer than four, maybe even just one table -
the size of that table may explode, but I can generate it algorithmically.
If I go this route, defining no *jmo tables, can I depend on ccmp and liga
always being applied and always in that order?  Is there some longer
sequence of global tables I can depend on always being applied and always in
a specific order?  Will the "shaper", even in the absence of *jmo tables,
perform some translations on the sequence of code points that I need to know
about in building my substitution table(s)?

Ever since attending Jin-Hwan Cho's talk at TUG 2013, it's been on my to-do
list to take a close look at Dohyun Kim's work in the HCR fonts.  Maybe now
is a good time for me to to do that.  I think the HCR fonts have a much
different architecture from mine because of using no precomposed syllables,
and many more on-the-fly layouts and jamo variants.  (I don't know if I
clearly addressed a question from Jin-Hwan Cho in our discussions at the
conference:  my fonts have at most five variants of each jamo, far fewer than
HCR, *but* I only use those variants at all when there's no precomposed
syllable.  The number of variants built into the precomposed syllables is
far greater.)  Presumably the HCR fonts have to solve similar problems to
mine of interacting predictably with the "shaper" and working well on a wide
range of software, so their solutions may be useful.

-- 
Matthew Skala
mskala at ansuz.sooke.bc.ca                 People before principles.
http://ansuz.sooke.bc.ca/