<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Again thanks for all the valuable feedback. I've looked into it a bit more </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">and things are falling into place now. Though I couldn't find any concrete </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">information on what the value of "cluster" in Harfbuzz means. Same with the </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">value returned by the BreakIterator of ICU. I'm interpreting them as </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">byte offsets which I think is correct</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">To get values in byte-offsets I had to use an UText instead of a UnicodeString</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">in combination with the BreakIterator. When using a UnicodeString the data I </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">pass into the constructor is converted from UTF-8 to UTF-16 and the values </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">returned by the BreakIterator wouldn't align with the byte-offsets in the </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">clusters of Harfbuzz (hb_glyph_info_t). </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">My current thinking to calculate the caret position is as follows:</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><i>Lets say I want to position the caret just before the 2nd graph meme: </i></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">- find the byte offset of the 2nd graphmeme (using BreakIterator)</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">- find the HB-cluster to which the graphmeme belongs based on the byte-offset</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">- using the start and end byte offsets of the cluster, check how many graphmemes <br>are part of the HB-cluster. We divide the x_advance by this number so we know how much we need to move the cursor per graphmeme in the cluster.<br></div></blockquote><font face="arial, helvetica, sans-serif"><div><font face="arial, helvetica, sans-serif"><br></font></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;display:inline">I created an image that clarifies the meaning of graphmemes, glyphs, clusters and the byte values. You can find the image here:</div></font><div><font face="arial, helvetica, sans-serif"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;display:inline"><br></div></font></div><div><font face="arial, helvetica, sans-serif"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;display:inline"> </div><a href="https://www.flickr.com/photos/diederick/15749726814/">https://www.flickr.com/photos/diederick/15749726814/</a></font></div><div><span style="font-family:arial,helvetica,sans-serif"><br></span></div><div><span style="font-family:arial,helvetica,sans-serif"></span></div><div><font face="arial, helvetica, sans-serif"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;display:inline">Just wanted to share this approach and hopefully get some feedback.</div><br></font></div><div><font face="arial, helvetica, sans-serif"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif;display:inline"><br></div></font></div><div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Best</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">D</div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jan 24, 2015 at 3:43 PM, Diederick Huijbers ☾ <span dir="ltr"><<a href="mailto:diederick@apollomedia.nl" target="_blank">diederick@apollomedia.nl</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Hi Richard, </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">It seems that gmail automatically replied to your email address, not to the list. </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I'll paste my message here again:</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">----</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px">I've posted some test code which uses Freetype to load a font,</div><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px">Harfbuzz for shaping and ICU to get the graphmemes. This is all</div><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px">experimental and I cannot verify if my code is the best/correct way. </div><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px"><br></div><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px">But this is a start that I'm using to calculate the caret offset for </div><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px">strings with ligatures. It does not yet contain the code to do this.</div><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px"><br></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><font face="arial, helvetica, sans-serif"> <a href="https://gist.github.com/roxlu/da3251cb2045823922fa" target="_blank">https://gist.github.com/roxlu/da3251cb2045823922fa</a></font><br></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><font face="arial, helvetica, sans-serif"><br></font></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><font face="arial, helvetica, sans-serif">Needs to link with ICU, Freetype and Harfbuzz. </font></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><font face="arial, helvetica, sans-serif"><br></font></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><font face="arial, helvetica, sans-serif">D.</font></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><span style="font-family:arial,helvetica,sans-serif;font-size:12.8000001907349px"><br></span></div><div class="gmail_default" style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:12.8000001907349px"><span style="font-family:arial,helvetica,sans-serif;font-size:12.8000001907349px">---</span><br></div></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Thanks for your answer; I see how I can arrive at the byte offsets when</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">thinking about it, but not how to use ICU / Harfbuzz. </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><div class="gmail_default" style="color:rgb(0,0,0);font-size:12.8000001907349px"><br></div></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jan 24, 2015 at 3:34 PM, Richard Wordingham <span dir="ltr"><<a href="mailto:richard.wordingham@ntlworld.com" target="_blank">richard.wordingham@ntlworld.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Sat, 24 Jan 2015 13:45:37 +0100<br>
Diederick Huijbers ☾ <<a href="mailto:diederick@apollomedia.nl" target="_blank">diederick@apollomedia.nl</a>> wrote:<br>
<br>
> Thanks so much Richard, one question though .... (see below)<br>
<br>
Please reply to the list ( <a href="mailto:HarfBuzz@lists.freedesktop.org" target="_blank">HarfBuzz@lists.freedesktop.org</a> ), not just<br>
to me.<br>
<br>
> > The ICU positions translate to byte offsets as:<br>
<br>
> > Position 0 = Byte offset 0<br>
> > Position 1 = Byte offset 3<br>
> > Position 2 = Byte offset 6<br>
> > Position 3 = Byte offset 9<br>
> > Position 4 = Byte offset 10 (previous character was ASCII space)<br>
> > Position 5 = Byte offset 13<br>
> > Position 6 = Byte offset 16<br>
> > Position 7 = Byte offset 19<br>
> > Position 8 = Byte offset 20<br>
> > Position 9 = Byte offset 23<br>
> > Position 10 = Byte offset 26<br>
> > Position 11 = Byte offset 29 (end of string, so no cluster, no<br>
> > glyphs)<br>
<br>
> > The ICU positions are 16-bit word offsets in UTF-16. I don't know<br>
> > if there is a UTF-8 interface; I believe ICU word segmentation that<br>
> > needs dictionary lookup is broken for UTF-8.<br>
<br>
> How did you arrive to this mapping? I'm wondering what structs hold<br>
> these information.<br>
<br>
If it's precomputed for you, I think that will be done by ICU rather<br>
than by HarfBuzz.<br>
<br>
I know the lengths of Unicode characters (by codepoint) in the UTF-8 and<br>
UTF-16 encodings. I also knew that the HarfBuzz cluster numbers would<br>
be byte offsets, so I checked my workings that way. I would<br>
generate such a table by stepping through the string, character by<br>
character. Strictly, one should ensure that the UTF-8 string consists<br>
only of UTF-8 characters, e.g. no CESU-8 or Latin-1 masquerading as<br>
UTF-8. I would treat surrogate codepoints (U+D800 to U+DFFF) as<br>
corresponding to two UTF-8 bytes. If the string originates as a<br>
sequence of characters in UTF-8, there will be no lone surrogates to<br>
create trouble.<br>
<br>
I would test the generation of this conversion table using a mixture of<br>
1-byte, 2-byte and 4-byte characters.<br>
<br>
Richard.<br>
_______________________________________________<br>
HarfBuzz mailing list<br>
<a href="mailto:HarfBuzz@lists.freedesktop.org" target="_blank">HarfBuzz@lists.freedesktop.org</a><br>
<a href="http://lists.freedesktop.org/mailman/listinfo/harfbuzz" target="_blank">http://lists.freedesktop.org/mailman/listinfo/harfbuzz</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div><div dir="ltr"><div><font color="#000099">Apollo +++++++++ </font></div><div><font color="#000099">Interactive Media</font></div><div><font color="#000099">+++++++++++++++ </font></div><div><font color="#000099">Diederick Huijbers === </font></div><div><font color="#000099"><a href="mailto:diederick@apollomedia.nl" target="_blank">diederick@apollomedia.nl</a></font></div><div><font color="#000099">==================== </font></div><div><font color="#000099">Zeeburgerpad 74 ::::::::</font></div><div><font color="#000099">1019 AD Amsterdam </font></div><div><font color="#000099">mobile 06 - 12 44 09 22</font></div><div><font color="#000099">phone 020 - 707 78 96 </font></div><div><font color="#000099">//\\//\\//\\//\\//\\//\\//\\//\\//\\ </font></div><div><font color="#000099"><a href="http://www.apollomedia.nl" target="_blank">www.apollomedia.nl</a> +++ </font></div><div><font color="#000099">++++++++++++++++</font></div></div></div>
</div>
</blockquote></div><br></div>