[HarfBuzz] Ligatures

Sat May 23 14:22:58 UTC 2020

> Date: Sat, 23 May 2020 14:51:53 +0100
> From: Richard Wordingham <richard.wordingham at ntlworld.com>
> 
> > > They may of course have more than one set of such rules, with the
> > > rule sets defining different sets of sequences.  
> > 
> > Who are "they" in this context?
> 
> Devanagari and Tai Tham are two examples I am aware of.

Emacs supports more than one rule for each composable sequence of
characters.

> Devanagari has different rules for positioning of Vedic marks between
> fonts using the script tags dev and dev2 for it on one hand and the
> unofficial script tag dev3, which follows the USE rules for character
> ordering.  For tag dev, Microsoft says that <consonant, virama,
> candrabindu, consonant> is one cluster; others, including Unicode, say
> it's two.  Candrabindu in the middle and candrabindu at the end mean
> different things; the former nasalises a consonant, while the latter
> nasalises a vowel.  The visual distinction exists, at least when
> half-forms are used.

See the rules set up near the end of indian.el in Emacs.  If they
don't cover what you describe, we can add more.

> > I'm not talking about Arabic.  Emacs has a set of regular expressions
> > for sequences of Arabic characters that need shaping, misc-lang.el in
> > Emacs.  If the set is incomplete, we can augment it.
> 
> That regular expression treats every Arabic word as in need of shaping. 
> 
> > If a font requires special shaping for any sequence of any number of
> > 26 (or maybe 52) ASCII letters, then the Emacs display engine will
> > need to be redesigned.  So this extreme possibility doesn't bother me.
> 
> In general, they do require it.  But how is this worse than handling
> Arabic?

I don't know.  Maybe it isn't.  Or maybe the slowdown while displaying
ASCII and moving the cursor through it will be unbearable.

> Is the problem that you want to keep the option of line
> wrapping splitting words for ASCII, but are not bothered for Arabic or
> other human languages?

Does Emacs indeed fail to wrap Arabic text?  can you show an example?

> > > How would you handle the possibility that all three of <æ>, <a, e>
> > > and <a, ZWJ, e> might be rendered by the same glyph, althouɡh they
> > > are comprised of 1, 2 and 3 characters respectively?  
> > 
> > By using a composition rule that matches both <a, e> and <a, ZWJ, e>.
> > The rules are regexp-based, and expressing the above as a regexp is
> > simple.  Once a sequence of characters matches the regexp, Emacs calls
> > the shaper (hb_shape etc.) to produce the font glyphs for the
> > sequence, and displays the glyphs that the shaper returns.
> 
> I think you mean that Emacs would store the position of components by
> an index that was the sequence of characters, not the glyph ID.  That
> would also deal with precomposed characters - it would be the character
> sequence that mattered, and for cursor movement and rendering,
> the canonically equivalent sequence(s) and the precomposed character
> would remain distinct.

Sorry, I don't follow: what do you mean by "store"?  Emacs stores the
rules used to compose characters, and it stores the results of the
compositions already done by applying those rules, as part of
displaying some chunk of text.  Which one of these did you have in
mind?