[HarfBuzz] Ligatures

Richard Wordingham richard.wordingham at ntlworld.com
Sun May 24 19:27:26 UTC 2020


On Sun, 24 May 2020 17:18:27 +0300
Eli Zaretskii <eliz at gnu.org> wrote:

> > Date: Sat, 23 May 2020 21:42:24 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>

> > > As for different scripts: if the character codepoints are the
> > > same, Emacs currently assigns each character to a single script.  

> > I'll need to dig deeper.  Composition of both 'a' and Greek alpha
> > with an acute accent works, which suggest that the problem isn't
> > there for characters with a script property of 'inherited'.  

> Emacs currently leaves it up to HarfBuzz to guess the script, as it
> doesn't yet have the necessary smarts.

I thought the issue lay within Emacs.  HarfBuzz has been fairly
civilised about combining marks in the 'wrong' script run.  If I put
Thai marks in what is basically a Tai Tham script run, it seems to
treat them properly.  I do such a strange thing because the marks have
been borrowed into Tai Tham, but not yet encoded.  I was told I
couldn't do this in Emacs 24.

It seems to me that Emacs knows what script a cluster is in; perhaps
it just hasn't united the concepts.  Users may have written some weird
clustering combinations, and I can imagine some weird combinations in
the Private Use Areas.  I should investigate.

> > The behaviour in 27.05 is the almost the same as for 24.4, but the
> > breaking in item (1) is automatically repaired.

> > Pressing the 'delete' key still deletes a single character, but may
> > be that because it's mapped to tpu-delete-current-char.  

It's OK, it's still working with emacs -q.  That means one can easily
replace the initial character of a cluster.

> If you press DEL (or Backspace), it will delete a single codepoint.

That only deletes the final cluster.

> > So, what's not working in Arabic is that one can't move the cursor
> > through ligatures.  
> 
> That's a feature (you can disable it with disable-point-adjustment).

Is this documented in info, or does one have to trawl the code to find
out what it does?  It seems that Emacs needs several levels of movement
- by codepoints, by grapheme cluster, by akshara (will be the same as
grapheme cluster in many cases) and by HarfBuzz cluster, or whatever
is used to make access into lam-alif impossible. Visible motion by
akshara is the minimum requirement for English, so that stepping
through 'ffi' will visibly advance the cursor.  LibreOffice writer aims
to provide visible cursor motion at the grapheme cluster level, so one
can use the cursor to step through the consonants in an akshara.

By codepoint is useful for editing complex aksharas; it is even more
useful if the cursor acts like a cluster terminator, but that is
probably a matter of personal taste.  It will also be useful for
editing narrow phonetic transcriptions, which can be quite heavy on
diacritics.

By grapheme cluster (at least, by default grapheme cluster) is level
encouraged by Unicode, and will give you letter-by-letter control even
if you're editing Sanskrit in an Indian script.  For Arabic, European
and Hebrew scripts, this is the same as akshara level.

By akshara is the current default movement level for most Indian scripts
in Emacs.  It is also the level at which the most Hindi speakers
claim to operate.  (I get the impression, however, that a lot of
Indians do their fine level editing of complicated text in
transliteration!)

By HarfBuzz cluster takes you to the level where HarfBuzz will easily
give you cursor positions.  Now occasionally HarfBuzz's actual clusters
won't combine whole grapheme clusters or aksharas.  For example, Thai
vowels could be roughly placed for Thai without taking into account of
the previous letters, just as on typewriters, and one can even handle
Thai tone marks like that.  It's possible that in these cases, HarfBuzz
will not form clusters.  How you handle these cases is up to you.  I
would make 'by HarfBuzz cluster' the coarsest.

I don't think motion by HarfBuzz cluster is useful - perhaps you know
of a use.

Richard.


More information about the HarfBuzz mailing list