[HarfBuzz] Carifying graphmemes, HB-cluster, carets and string manipulation

Fri Jan 23 16:49:37 PST 2015

On Fri, 23 Jan 2015 22:45:19 +0100
Diederick Huijbers <diederickh at gmail.com> wrote:

> This seems to be a 1:1 match, but my biggest question is how I can
> map the ICU boundaries
> to the correct HB-buffer/clusters?

HarfBuzz cluster n starts at position n, assuming you're loading UTF-16
strings into HarfBuzz.  Some ICU boundaries will not have a
corresponding boundary.  For example, when rendering English, the three
character string "fit" will have ICU boundaries at positions 0, 1, 2
and 3, but, for a good font, there will be only two HarfBuzz clusters,
those starting at positions 0 and 2.  The reason is that for
English, "fi" is best rendered by a ligature.

While ligatures can mostly be handled by evenly splitting the glyph
between the components, sometimes this is spectacularly wrong.  For
example, Khmer <U+1780 KHMER LETTER KA, U+17D2 KHMER SIGN COENG, U+179A
KHMER LETTER RO> formally splits into grapheme clusters <U+1780, U+17D2> and
<U+179A>, but although it appears as two glyphs, the left-hand one
derives from <U+17D2, U+179A> and the right-hand one derives from
<U+1780>. Harfbuzz reports the string as single cluster.

> *String manipulation:*
> When I want the user to manipulate the text inside the input field,
> with e.g. delete
> and backspace keys, should I manipulate the graphmemes? or the UTF-8
> codepoints?
> or maybe something else?

Standard practice is to kick the users of complex scripts in the teeth
and deny them access to characters inside a 'grapheme cluster'.  (In
one script I work with, having 3 or 4 marks within a grapheme cluster is
not unusual.  Correcting the base character is impossible - I have to
retype the entire cluster.) Deleting backwards just deletes one
character, while deleting forwards deletes a whole grapheme cluster.
The left and right arrows move one grapheme cluster at a time.

I haven't worked out how cursor positioning is done for grapheme
clusters merged by ligatures.  Perhaps it is done by interpolation for
European scripts and simply given up on for Indic scripts, snapping the
cursor to the boundaries of the Harfbuzz cluster.

LibreOffice 4.3.3.2 currently gets very confused by the sequence
<U+1A2F TAI THAM LETTER DA, U+1A60 TAI THAM SIGN SAKOT, U+1A45 TAI THAM
LETTER WA, U+1A60, U+1A75 TAI THAM SIGN TONE-1, U+1A3F TAI THAM LETTER
LOW YA, U+1A20 TAI THAM LETTER HIGH KA>.

<U+1A60, U+1A45>, and <U+1A75> are non-spacing glyphs.
<U+1A60, U+1A3F> is a spacing combining mark which starts below the
base character.

The grapheme clusters are <U+1A2F, U+1A60>, <U+1A45, U+1A60, U+1A75>,
<U+1A3F> and <U+1A20>.

The successive cursor positions are: 

Before U+1A2F (correct)
After U+1A2F (defensible)
3/4 of the way through U+1A20 (wildly wrong!)
Before U+1A20 (correct)
After U+1A20 (correct)

A civilised method of cursor positioning for knowledgeable users is to
disable shaping of a cluster when the cursor is within the cluster -
the user can then see what he is doing. This is particularly useful if
transposing characters results in a visually identical but canonically
inequivalent string. The disadvantage is that there may be significant
reflow issues when working with paragraphs.  There doesn't seem to be a
convention for switching between stepping by grapheme and stepping by
character.

Richard.