[HarfBuzz] Question on converting UTF-8 codepoints to complex glyphs

Thu Apr 11 21:24:34 UTC 2019

Thanks Richard for the pointer. I wish I had seen Jonathan's post. However, it never appeared in the digest I received from the list (nor to me directly) so I never saw it. To be fair, the following is from the HarfBuzz tutorial on the "Why do I need a shaping engine?" page:  "For example, in Tamil, when the letter "TTA" (ட) letter is followed by "U" (உ), the pair must be replaced by the single glyph "டு". The sequence of Unicode characters "டஉ" needs to be substituted with a single "டு" glyph from the font." So maybe that needs an edit.

I converted my UTF-8 string to be [0xE0, 0xAE, 0x88, 0xE0, 0xAE, 0x9F, 0xE0, 0xAF, 0x81] and I finally got back the correct glyph identifiers. So thank you all for your responses. I'm sure I'll have more questions as this project evolves.

-----Original Message-----
From: Richard Wordingham <richard.wordingham at ntlworld.com> 
Sent: April 11, 2019 12:16 PM
To: harfbuzz at lists.freedesktop.org
Cc: Paul Daughetee <Daughetee at finaldraft.com>
Subject: Re: [HarfBuzz] Question on converting UTF-8 codepoints to complex glyphs

On Thu, 11 Apr 2019 18:03:10 +0000
Paul Daughetee <Daughetee at finaldraft.com> wrote:

>  டு  [...]
> is the ligature formed by the codepoints corresponding to the glyphs ட 
> and உ.

No!  You already have been told by Jonathan Kew.

டு is the codepoint sequence <U+0B9F TAMIL LETTER TTA, U+0BC1 TAMIL VOWEL SIGN U>; it is **not** the ligature of ட <U+0B9F TAMIL LETTER
TTA> and உ <u+0B89 TAMIL LETTER U> .  If you don't believe me, paste
them into Word and use alt/X to convert the characters to their codepoints.

Richard.