[HarfBuzz] Carifying graphmemes, HB-cluster, carets and string manipulation

Sat Jan 24 04:11:27 PST 2015

On Sat, 24 Jan 2015 09:52:32 +0100
Diederick Huijbers <diederickh at gmail.com> wrote:

(I'm assuming the post was meant to be directed to the list - there are
others there with more experience than me.)

> Thanks for your explanation. Your describing a situation where you
> load UTF-16 strings into Harfbuzz, though I'm using UTF-8 string. I
> guess it's the same for UTF-8?

I didn't look carefully enough at your post.  I somehow though you got
cluster numbers 0 to 10.  You didn't; you got cluster numbers 0, 3, 6,
9, 10, 13, 16, 19, 20, 23 and 26.  For UTF-8 input, the cluster numbers
are the byte offsets of the first character corresponding to the
HarfBuzz cluster.  The reported cluster numbers are, by design, weakly
monotonic as one progresses through the list of glyphs - increasing for
LTR writing and decreasing for RTL writing.

The ICU positions translate to byte offsets as:

Position 0 = Byte offset 0 
Position 1 = Byte offset 3 
Position 2 = Byte offset 6 
Position 3 = Byte offset 9 
Position 4 = Byte offset 10 (previous character was ASCII space) 
Position 5 = Byte offset 13 
Position 6 = Byte offset 16
Position 7 = Byte offset 19
Position 8 = Byte offset 20
Position 9 = Byte offset 23
Position 10 = Byte offset 26 
Position 11 = Byte offset 29 (end of string, so no cluster, no glyphs)

The ICU positions are 16-bit word offsets in UTF-16.  I don't know if
there is a UTF-8 interface; I believe ICU word segmentation that needs
dictionary lookup is broken for UTF-8.

> I'm still trying to find a solution to map ICU graphmemes to Harfbuzz
> glyphs so I can calculate the X-offset of the caret I'm drawing. Can
> someone maybe describe how to use the Harfbuzz API and/or ICU library
> to do that?

So, in a simple case, to locate a boundary at position 1, one
progresses:

Position 1 = byte offset 3
There is a cluster at '3', so add up the advance widths of glyphs in
the previous clusters.

That is the basic algorithm, which will work for straightforward writing
systems like Vietnamese or Chinese so long as ligatures are avoided.
The rule is slightly different for RTL scripts. Complications arise
with ligatures and with Indic rearrangement.  Unfortunately, Unicode
took Devanagari as the prototypical Indic script, but half-forms are
not an early Indic feature.

Let us return to my Tai Tham example:

<U+1A2F TAI THAM LETTER DA, U+1A60 TAI THAM SIGN SAKOT, U+1A45 TAI THAM
LETTER WA, U+1A60, U+1A75 TAI THAM SIGN TONE-1, U+1A3F TAI THAM LETTER
LOW YA, U+1A20 TAI THAM LETTER HIGH KA>

The grapheme cluster starts and contents are:

pos=0 byte offset=0 cpts: 1A2F, 1A60
pos=2 byte offset=6 cpts: 1A45, 1A60, 1A75
pos=5 byte offset=15 cpts: 1A3F
pos=6 byte offset=18 cpts: 1A20
pos=7 byte offset=21

Harfbuzz reports two clusters, at offsets 0 and 18.

The glyphs, with advance widths in brackets, are

Cluster 0: uni1A2F(1212), uni1A601A45(0), uni1A75(0), uni1A601A3F(464)
Cluster 18: uni1A20(1910)

The boundary at pos=0 is at x = 0.
The boundary at pos=6 is at x = 1212 + 464 = 1676.
The boundary at pos=7 is at x = 1676 + 1910 = 3586.

For pos=2, we have no data.  The simple trick is to render the string
up to pos=2.  I have to admit I do not know the ins and outs of
justification.

When we do this, we get:

Cluster 0: uni1A2F(1212), uni1A60(0)

From this, we may decide that the boundary at pos = 2 is at x=1212.
Note, however, the glyph uni1A60 does not appear in the rendering of
the complete string!

For pos=5, we repeat the trick and render the string up to pos=5.

We then get:

Cluster 0:  , uni1A60(0), uni25CC(1787),
uni1A75(0)

From this we may decide that the boundary at pos=5 is at x = 1212 +
1787 = 2999.  What has gone severely wrong here is the insertion of the
dreaded dashed circle.  This happens for this string with *old* versions
of HarfBuzz such as the one LibreOffice is clearly using.  My font
clears up the dashed circle when there is a consonant following U+1A60
in some canonically equivalent string of Tai Tham characters, but
leaves it in a case like this because the string is *linguistically
wrong*.

Even with up-to-date HarfBuzz, we still get a glyph for the
substring that does not appear in the full string.  However, the cursor
position would then be calculated as x = 1212, i.e. the same as for the
previous grapheme cluster boundary.  This is not unreasonable, for the
grapheme cluster merely leads to the addition of non-spacing glyphs.

Note that if one just examined the rendering of the string between pos=2
and pos=5, the glyphs uni1A2F(1212), uni1A601A45(0) would be replaced
by uni1A45(1212), yet another glyph which does not appear in the
rendering of the complete string.

I hope this helps.

Richard.