[HarfBuzz] Ligatures

Sat May 23 13:51:53 UTC 2020

On Sat, 23 May 2020 09:09:48 +0300
Eli Zaretskii <eliz at gnu.org> wrote:

> > Date: Fri, 22 May 2020 22:22:49 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>
> >   
> > > The current support for producing ligatures works in the same way
> > > as complex text shaping for scripts that require that, like
> > > Arabic and Khmer: the sequences of characters that can be
> > > displayed as ligatures are identified in advance with suitable
> > > regular expressions, and the display engine then passes these
> > > sequences to hb_shape to produce the ligatures.
> > > 
> > > This works well for scripts that require complex shaping, because
> > > such scripts generally have well-defined rules for the sequences
> > > of codepoints that need shaping.  
> > 
> > They may of course have more than one set of such rules, with the
> > rule sets defining different sets of sequences.  
> 
> Who are "they" in this context?

Devanagari and Tai Tham are two examples I am aware of.

Devanagari has different rules for positioning of Vedic marks between
fonts using the script tags dev and dev2 for it on one hand and the
unofficial script tag dev3, which follows the USE rules for character
ordering.  For tag dev, Microsoft says that <consonant, virama,
candrabindu, consonant> is one cluster; others, including Unicode, say
it's two.  Candrabindu in the middle and candrabindu at the end mean
different things; the former nasalises a consonant, while the latter
nasalises a vowel.  The visual distinction exists, at least when
half-forms are used.

Tai Tham has an issue with the mark U+1A58 TAI THAM SIGN MAI KANG LAI.
It is, at least formally, a non-spacing mark.  It occurs at the
juncture of two syllables in the same words.  Modern, printed Tai Khuen
happily treats it as syllable-final.  In more traditional styles, it
starts syllables, going above the first consonant, and so to the right
of a vowel mark reordered to the left hand side of the syllable.  Some
fonts seem to just let it hang over the start of the next syllable,
taking pot luck with what's there.  That gives two different syllable
structures.

As I supported the style found in a certain dictionary, it sometimes
belongs with the syllable before, and sometimes with the syllable after
it.  I therefore ended up defined the sequences to be shaped as a
sequence of one or more syllables joined together by U+1A58.
Fortunately, normal cursor motion is controlled by a different
definition.  (I'm still using Emacs 24.4 with the restoration of
interactive commands forward-char-intrusive and backward-char-intrusive
and their interface within the C code.)

> I understand that the number of combinations is theoretically
> unbounded.  I'm asking if it is also unbounded in practice.  That is,
> do font designers add ligatures for arbitrary combinations of
> characters, regardless of some reasonable set of requirements?  For
> example, is the set of ligatures of Latin characters shown here:
> 
>   https://en.wikipedia.org/wiki/Orthographic_ligature#Latin_alphabet
> 
> reasonably complete, or should I expect any number of other arbitrary
> combinations of Latin characters popping up in fonts?  And if the
> latter, then what is the purpose of providing such arbitrary
> ligatures?

Doesn't the existence of ligatures for 'Eisenhower' and 'Chamberlain'
provide enough of an answer?

If you claim to support handwriting fonts, then you can expect others -
'sh', 'tt' and 'ing' are fairly obvious ones.  You may also find
ligatures being used to sort out kerning issues.

One problem I've observed with computer fonts is that the spacing of
glyphs in a string is not consistent.  This appears to be due to the
way the positioning of the glyphs is rounded.  The problem can be bad
enough that the designer ends up fixing the problem by combining them
into a single glyph, which formally is a ligature.  I've not noticed
this in ASCII fonts, but then I haven't looked hard at them.

The 'tt' ligature can arise because the two t's are crossed by a
single stroke.  Crossing the 't' in 'lt' might be handled by a special
't' glyph, or one might just form an 'lt' ligature.  The ending 'ing'
is common enough that I unconsciously developed an abbreviated way of
writing it.

> I'm not talking about Arabic.  Emacs has a set of regular expressions
> for sequences of Arabic characters that need shaping, misc-lang.el in
> Emacs.  If the set is incomplete, we can augment it.

That regular expression treats every Arabic word as in need of shaping. 

> If a font requires special shaping for any sequence of any number of
> 26 (or maybe 52) ASCII letters, then the Emacs display engine will
> need to be redesigned.  So this extreme possibility doesn't bother me.

In general, they do require it.  But how is this worse than handling
Arabic?  Is the problem that you want to keep the option of line
wrapping splitting words for ASCII, but are not bothered for Arabic or
other human languages?  ASCII does not satisfyingly suffice for
English.

> > How would you handle the possibility that all three of <æ>, <a, e>
> > and <a, ZWJ, e> might be rendered by the same glyph, althouɡh they
> > are comprised of 1, 2 and 3 characters respectively?  
> 
> By using a composition rule that matches both <a, e> and <a, ZWJ, e>.
> The rules are regexp-based, and expressing the above as a regexp is
> simple.  Once a sequence of characters matches the regexp, Emacs calls
> the shaper (hb_shape etc.) to produce the font glyphs for the
> sequence, and displays the glyphs that the shaper returns.

I think you mean that Emacs would store the position of components by
an index that was the sequence of characters, not the glyph ID.  That
would also deal with precomposed characters - it would be the character
sequence that mattered, and for cursor movement and rendering,
the canonically equivalent sequence(s) and the precomposed character
would remain distinct.

Richard.