[HarfBuzz] Carifying graphmemes, HB-cluster, carets and string manipulation

Diederick Huijbers diederickh at gmail.com
Mon Jan 26 05:12:51 PST 2015


Again thanks for all the valuable feedback. I've looked into it a bit more
and things are falling into place now. Though I couldn't find any concrete
information on what the value of "cluster" in Harfbuzz means. Same with the
value returned by the BreakIterator of ICU. I'm interpreting them as
byte offsets which I think is correct

To get values in byte-offsets I had to use an UText instead of a
UnicodeString
in combination with the BreakIterator. When using a UnicodeString the data
I
pass into the constructor is converted from UTF-8 to UTF-16 and the values
returned by the BreakIterator wouldn't align with the byte-offsets in the
clusters of Harfbuzz (hb_glyph_info_t).

My current thinking to calculate the caret position is as follows:
*Lets say I want to position the caret just before the 2nd graph meme: *

- find the byte offset of the 2nd graphmeme (using BreakIterator)
- find the HB-cluster to which the graphmeme belongs based on the
byte-offset
- using the start and end byte offsets of the cluster, check how many
graphmemes
are part of the HB-cluster. We divide the x_advance by this number so we
know how much we need to move the cursor per graphmeme in the cluster.


​I created an image that clarifies the meaning of graphmemes, glyphs,
clusters and the byte values.  You can find the image here:


https://www.flickr.com/photos/diederick/15749726814/

​
​Just wanted to share this approach and ​hopefully get some feedback.


​Best
D​


On Sat, Jan 24, 2015 at 3:43 PM, Diederick Huijbers ☾ <
diederick at apollomedia.nl> wrote:

> Hi Richard,
>
> It seems that gmail automatically replied to your email address, not to
> the list.
>
> I'll paste my message here again:
>
> ----
>
> I've posted some test code which uses Freetype to load a font,
> Harfbuzz for shaping and ICU to get the graphmemes. This is all
> experimental and I cannot verify if my code is the best/correct way.
>
> But this is a start that I'm using to calculate the caret offset for
> strings with ligatures. It does not yet contain the code to do this.
>
>          https://gist.github.com/roxlu/da3251cb2045823922fa
>
> Needs to link with ICU, Freetype and Harfbuzz.
>
> D.
>
> ---
>
> Thanks for your answer;  I see how I can arrive at the byte offsets when
> thinking about it, but not how to use ICU / Harfbuzz.
>
>
>
>
> On Sat, Jan 24, 2015 at 3:34 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>
>> On Sat, 24 Jan 2015 13:45:37 +0100
>> Diederick Huijbers ☾ <diederick at apollomedia.nl> wrote:
>>
>> > ​Thanks so much Richard, one question though .... (see below)​
>>
>> Please reply to the list ( HarfBuzz at lists.freedesktop.org ), not just
>> to me.
>>
>> > > The ICU positions translate to byte offsets as:
>>
>> > > Position 0 = Byte offset 0
>> > > Position 1 = Byte offset 3
>> > > Position 2 = Byte offset 6
>> > > Position 3 = Byte offset 9
>> > > Position 4 = Byte offset 10 (previous character was ASCII space)
>> > > Position 5 = Byte offset 13
>> > > Position 6 = Byte offset 16
>> > > Position 7 = Byte offset 19
>> > > Position 8 = Byte offset 20
>> > > Position 9 = Byte offset 23
>> > > Position 10 = Byte offset 26
>> > > Position 11 = Byte offset 29 (end of string, so no cluster, no
>> > > glyphs)
>>
>> > > The ICU positions are 16-bit word offsets in UTF-16.  I don't know
>> > > if there is a UTF-8 interface; I believe ICU word segmentation that
>> > > needs dictionary lookup is broken for UTF-8.
>>
>> > ​How did you arrive to this mapping? I'm wondering what structs hold
>> > these information.
>>
>> If it's precomputed for you, I think that will be done by ICU rather
>> than by HarfBuzz.
>>
>> I know the lengths of Unicode characters (by codepoint) in the UTF-8 and
>> UTF-16 encodings.  I also knew that the HarfBuzz cluster numbers would
>> be byte offsets, so I checked my workings that way.  I would
>> generate such a table by stepping through the string, character by
>> character. Strictly, one should ensure that the UTF-8 string consists
>> only of UTF-8 characters, e.g. no CESU-8 or Latin-1 masquerading as
>> UTF-8.  I would treat surrogate codepoints (U+D800 to U+DFFF) as
>> corresponding to two UTF-8 bytes. If the string originates as a
>> sequence of characters in UTF-8, there will be no lone surrogates to
>> create trouble.
>>
>> I would test the generation of this conversion table using a mixture of
>> 1-byte, 2-byte and 4-byte characters.
>>
>> Richard.
>> _______________________________________________
>> HarfBuzz mailing list
>> HarfBuzz at lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>>
>
>
>
> --
> Apollo +++++++++
> Interactive Media
> +++++++++++++++
> Diederick Huijbers ===
> diederick at apollomedia.nl
> ====================
> Zeeburgerpad 74 ::::::::
> 1019 AD Amsterdam
> mobile 06 - 12 44 09 22
> phone 020 - 707 78 96
> //\\//\\//\\//\\//\\//\\//\\//\\//\\
> www.apollomedia.nl +++
> ++++++++++++++++
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20150126/2410fe39/attachment.html>


More information about the HarfBuzz mailing list