[HarfBuzz] Carifying graphmemes, HB-cluster, carets and string manipulation
Richard Wordingham
richard.wordingham at ntlworld.com
Sat Jan 24 06:34:17 PST 2015
On Sat, 24 Jan 2015 13:45:37 +0100
Diederick Huijbers ☾ <diederick at apollomedia.nl> wrote:
> Thanks so much Richard, one question though .... (see below)
Please reply to the list ( HarfBuzz at lists.freedesktop.org ), not just
to me.
> > The ICU positions translate to byte offsets as:
> > Position 0 = Byte offset 0
> > Position 1 = Byte offset 3
> > Position 2 = Byte offset 6
> > Position 3 = Byte offset 9
> > Position 4 = Byte offset 10 (previous character was ASCII space)
> > Position 5 = Byte offset 13
> > Position 6 = Byte offset 16
> > Position 7 = Byte offset 19
> > Position 8 = Byte offset 20
> > Position 9 = Byte offset 23
> > Position 10 = Byte offset 26
> > Position 11 = Byte offset 29 (end of string, so no cluster, no
> > glyphs)
> > The ICU positions are 16-bit word offsets in UTF-16. I don't know
> > if there is a UTF-8 interface; I believe ICU word segmentation that
> > needs dictionary lookup is broken for UTF-8.
> How did you arrive to this mapping? I'm wondering what structs hold
> these information.
If it's precomputed for you, I think that will be done by ICU rather
than by HarfBuzz.
I know the lengths of Unicode characters (by codepoint) in the UTF-8 and
UTF-16 encodings. I also knew that the HarfBuzz cluster numbers would
be byte offsets, so I checked my workings that way. I would
generate such a table by stepping through the string, character by
character. Strictly, one should ensure that the UTF-8 string consists
only of UTF-8 characters, e.g. no CESU-8 or Latin-1 masquerading as
UTF-8. I would treat surrogate codepoints (U+D800 to U+DFFF) as
corresponding to two UTF-8 bytes. If the string originates as a
sequence of characters in UTF-8, there will be no lone surrogates to
create trouble.
I would test the generation of this conversion table using a mixture of
1-byte, 2-byte and 4-byte characters.
Richard.
More information about the HarfBuzz
mailing list