[HarfBuzz] Carifying graphmemes, HB-cluster, carets and string manipulation

Richard Wordingham richard.wordingham at ntlworld.com
Sat Jan 24 06:34:17 PST 2015


On Sat, 24 Jan 2015 13:45:37 +0100
Diederick Huijbers ☾ <diederick at apollomedia.nl> wrote:

> ​Thanks so much Richard, one question though .... (see below)​

Please reply to the list ( HarfBuzz at lists.freedesktop.org ), not just
to me.

> > The ICU positions translate to byte offsets as:

> > Position 0 = Byte offset 0
> > Position 1 = Byte offset 3
> > Position 2 = Byte offset 6
> > Position 3 = Byte offset 9
> > Position 4 = Byte offset 10 (previous character was ASCII space)
> > Position 5 = Byte offset 13
> > Position 6 = Byte offset 16
> > Position 7 = Byte offset 19
> > Position 8 = Byte offset 20
> > Position 9 = Byte offset 23
> > Position 10 = Byte offset 26
> > Position 11 = Byte offset 29 (end of string, so no cluster, no
> > glyphs)

> > The ICU positions are 16-bit word offsets in UTF-16.  I don't know
> > if there is a UTF-8 interface; I believe ICU word segmentation that
> > needs dictionary lookup is broken for UTF-8.
 
> ​How did you arrive to this mapping? I'm wondering what structs hold
> these information.

If it's precomputed for you, I think that will be done by ICU rather
than by HarfBuzz.

I know the lengths of Unicode characters (by codepoint) in the UTF-8 and
UTF-16 encodings.  I also knew that the HarfBuzz cluster numbers would
be byte offsets, so I checked my workings that way.  I would
generate such a table by stepping through the string, character by
character. Strictly, one should ensure that the UTF-8 string consists
only of UTF-8 characters, e.g. no CESU-8 or Latin-1 masquerading as
UTF-8.  I would treat surrogate codepoints (U+D800 to U+DFFF) as
corresponding to two UTF-8 bytes. If the string originates as a
sequence of characters in UTF-8, there will be no lone surrogates to
create trouble.

I would test the generation of this conversion table using a mixture of
1-byte, 2-byte and 4-byte characters.

Richard.


More information about the HarfBuzz mailing list