[HarfBuzz] Ligatures

Richard Wordingham richard.wordingham at ntlworld.com
Fri May 22 21:22:49 UTC 2020


On Fri, 22 May 2020 22:32:04 +0300
Eli Zaretskii <eliz at gnu.org> wrote:

> Hi,
> 
> This is a bit off-topic, but I thought it could be appropriate to ask
> here, since we have here some of the best experts on this subject.
> 
> We are discussing support for ligatures in Emacs, specifically when
> using HarfBuzz as the shaping engine.  See the discussion from
> 
>   https://lists.gnu.org/archive/html/emacs-devel/2020-05/msg02493.html
> 
> The current support for producing ligatures works in the same way as
> complex text shaping for scripts that require that, like Arabic and
> Khmer: the sequences of characters that can be displayed as ligatures
> are identified in advance with suitable regular expressions, and the
> display engine then passes these sequences to hb_shape to produce the
> ligatures.
> 
> This works well for scripts that require complex shaping, because such
> scripts generally have well-defined rules for the sequences of
> codepoints that need shaping.

They may of course have more than one set of such rules, with the rule
sets defining different sets of sequences.

> My original thoughts were that
> ligatures could be supported in the same way, based on the assumption
> that the list of possible ligatures is finite and can be stored in a
> suitable data stricture in advance.

At one level, this is true for any individual font, for it cannot have
more than 65,536 glyphs.

> However, I'm being told that this assumption is false, and that each
> font defines ligatures from any number of arbitrary combinations of
> characters, and therefore the exhaustive list of the ligatures is in
> practice infinite and cannot be provided in advance.

This arbitrariness is true.  Over the set of all credible fonts for a
given character repertoire, the number of ligating combinations is
unbounded.

> The only way of
> doing this right, I'm told, is to either (a) query the font to get the
> list of all the ligatures it supports, or (b) assume any combination
> of characters can produce a ligature, and therefore we need to pass
> all the characters intended for display through hb_shape.  The latter
> in particular is in stark contrast to how the current Emacs display
> code is designed and implemented.

> To be specific, I'm talking about 2 kinds of ligatures:
> 
>   . ligatures made of Latin characters, like "ffi" and "Th"
>   . ligatures produced from symbols, like "==>" that is
>     converted into ⟹
> 
> Can someone please tell what are the recommended practices regarding
> these ligatures?  Is the set of possible ligatures indeed infinite and
> impossible to know in advance?  And does HarfBuzz have APIs to query a
> font about the ligatures it supports?

Have you addressed the cursive scripts yet, such as Arabic?  At its
simplest, most consonants have four shapes, initial, medial, final and
isolated, and roughly speaking the shape used depends on the adjacent
spacing characters.  For the most part, Emacs would have to pass whole
words into HarfBuzz for shaping.  In some of the more advanced fonts,
the vowel marks in a word may also affect the shape of the consonant
skeleton.  And of course, sometimes the Arabic script prefers to join
letters vertically, as well as having a few straightforward ligatures.

A cursive Latin script font may behave in the same way, with the shape
of letters depending on what precedes and follows them.  With a small
enough character repertoire, there might be no ligatures, but your
rendering logic would fail miserably.

How would you handle the possibility that all three of <æ>, <a, e> and
<a, ZWJ, e> might be rendered by the same glyph, althouɡh they are
comprised of 1, 2 and 3 characters respectively?  And if Emacs is not
imposing a normalisation, then all the precomposed characters in
Unicode might have been entered as one or as more than one character? 

Richard.


More information about the HarfBuzz mailing list