[HarfBuzz] Ligatures

Sat May 23 06:09:48 UTC 2020

> Date: Fri, 22 May 2020 22:22:49 +0100
> From: Richard Wordingham <richard.wordingham at ntlworld.com>
> 
> > The current support for producing ligatures works in the same way as
> > complex text shaping for scripts that require that, like Arabic and
> > Khmer: the sequences of characters that can be displayed as ligatures
> > are identified in advance with suitable regular expressions, and the
> > display engine then passes these sequences to hb_shape to produce the
> > ligatures.
> > 
> > This works well for scripts that require complex shaping, because such
> > scripts generally have well-defined rules for the sequences of
> > codepoints that need shaping.
> 
> They may of course have more than one set of such rules, with the rule
> sets defining different sets of sequences.

Who are "they" in this context?

> > However, I'm being told that this assumption is false, and that each
> > font defines ligatures from any number of arbitrary combinations of
> > characters, and therefore the exhaustive list of the ligatures is in
> > practice infinite and cannot be provided in advance.
> 
> This arbitrariness is true.  Over the set of all credible fonts for a
> given character repertoire, the number of ligating combinations is
> unbounded.

I understand that the number of combinations is theoretically
unbounded.  I'm asking if it is also unbounded in practice.  That is,
do font designers add ligatures for arbitrary combinations of
characters, regardless of some reasonable set of requirements?  For
example, is the set of ligatures of Latin characters shown here:

  https://en.wikipedia.org/wiki/Orthographic_ligature#Latin_alphabet

reasonably complete, or should I expect any number of other arbitrary
combinations of Latin characters popping up in fonts?  And if the
latter, then what is the purpose of providing such arbitrary
ligatures?

> > To be specific, I'm talking about 2 kinds of ligatures:
> > 
> >   . ligatures made of Latin characters, like "ffi" and "Th"
> >   . ligatures produced from symbols, like "==>" that is
> >     converted into ⟹

Yes, these are the only cases that I'm asking here about.  I'm not
asking about shaping complex scripts such as Arabic, where this
problem doesn't exist AFAIK.

> Have you addressed the cursive scripts yet, such as Arabic?  At its
> simplest, most consonants have four shapes, initial, medial, final and
> isolated, and roughly speaking the shape used depends on the adjacent
> spacing characters.  For the most part, Emacs would have to pass whole
> words into HarfBuzz for shaping.  In some of the more advanced fonts,
> the vowel marks in a word may also affect the shape of the consonant
> skeleton.  And of course, sometimes the Arabic script prefers to join
> letters vertically, as well as having a few straightforward ligatures.

I'm not talking about Arabic.  Emacs has a set of regular expressions
for sequences of Arabic characters that need shaping, misc-lang.el in
Emacs.  If the set is incomplete, we can augment it.

> A cursive Latin script font may behave in the same way, with the shape
> of letters depending on what precedes and follows them.  With a small
> enough character repertoire, there might be no ligatures, but your
> rendering logic would fail miserably.

If a font requires special shaping for any sequence of any number of
26 (or maybe 52) ASCII letters, then the Emacs display engine will
need to be redesigned.  So this extreme possibility doesn't bother me.

> How would you handle the possibility that all three of <æ>, <a, e> and
> <a, ZWJ, e> might be rendered by the same glyph, althouɡh they are
> comprised of 1, 2 and 3 characters respectively?

By using a composition rule that matches both <a, e> and <a, ZWJ, e>.
The rules are regexp-based, and expressing the above as a regexp is
simple.  Once a sequence of characters matches the regexp, Emacs calls
the shaper (hb_shape etc.) to produce the font glyphs for the
sequence, and displays the glyphs that the shaper returns.

> And if Emacs is not imposing a normalisation, then all the
> precomposed characters in Unicode might have been entered as one or
> as more than one character?

If you are talking about composition with combining characters, Emacs
already has the rules to compose them as described above.  You can try
this in your Emacs: insert a, then U+0301 COMBINING ACUTE ACCENT, and
you should see them composed into a single glyph (provided that you
use a suitable font).

But I'm not asking about character composition in general, I'm asking
specifically about ligatures of ASCII characters, without any
non-ASCII codepoints or combining accents.