[HarfBuzz] Indic shaping: Tone Marks, Visargas, and insert_dotted_circles() Considered Harmful

David M. Jones dmj at dmj.ams.org
Sat Jun 13 09:47:56 PDT 2015


Before I submit a formal bug report, can anyone comment on whether I'm
on the right track here?

Consider this output from hb-shape (edited slightly for readability):

    lt-hb-shape (HarfBuzz) 0.9.40
    Available shapers: ot,fallback

    (त॑ः)
    <U+0924,U+0951,U+0903>
    [ {"g":"dTa",      "cl":0,"dx":0,"dy":0,"ax":573,"ay":0},
      {"g":"dUdatta",  "cl":0,"dx":0,"dy":0,"ax":0,"ay":0},
      {"g":"BASE",     "cl":0,"dx":0,"dy":0,"ax":724,"ay":0},
      {"g":"dVisarga", "cl":0,"dx":0,"dy":0,"ax":262,"ay":0} ]

BASE is the glyph name for U+25CC DOTTED CIRCLE in Murty Hindi [1],
but similar results obtain with other fonts.

Clearly, the dotted circle isn't wanted here.  As far as I can tell
from poking around in the source code, it's the result of two
interacting problems.

First, it looks like the Indic shaping module has a faulty model [2]
for analyzing Indic syllables -- at the very least, it doesn't
incorporate tone marks in Devanagari syllables correctly.  Rather than
treating the above sequence as a consonant_syllable, it's treating it
as a broken_cluster (I admit I don't quite follow the code, so I might
have the details wrong).  That, in turn, causes insert_dotted_circles()
to come into play.

So, one possible solution would be to update the
indic_syllable_machine and related code with information from the
Unicode 7.0.0 version of IndicSyllabicCategory.txt.  On the one hand,
this might well fix some other problems.  On the other hand, if there
aren't other known issues with the syllable analysis, it could be
risky to mess around too much.

Another approach would be to say that the real problem is
insert_dotted_circles(), which implements an aggressive form of the
"Show Hidden" fallback rendering strategy described in section 5.13 of
the Unicode 7.0 standard (pages 220-221).  I can think of two
arguments against this approach:

1) It violates the final paragraph of section 5.13:

       In a degenerate case, a nonspacing mark occurs as the first
       character in the text or is separated from its base character
       by a line separator, paragraph separator, or other format
       character that causes a positional separation. This result is
       called a defective combining character sequence (see Section
       3.6, Combination). Defective combining character sequences
       should be rendered as if they had a no-break space as a base
       character. (See Section 7.9, Combining Marks.)

2) Even in contexts not covered by that paragraph, I'd argue that as a
   quality-of-implementation issue, the "Simple Overlap" method would
   be preferable.  After all, if I want a dotted circle, I can easily
   add it myself.

Also, note also that the visarga is rather odd, being a *spacing*
combining character, so it's not clear to me that adding a dotted
circle to it is appropriate even under the "Show Hidden" policy.

Cheers,
David.

NOTES

[1] http://www.murtylibrary.com/mcli-fonts.php

[2] This wouldn't be surprising, especially if it is based on
    "Developing OpenType Fonts for Devanagari Script" (May 2008,
    http://www.microsoft.com/typography/OpenTypeDev/devanagari/intro.htm),
    which seems vague and incomplete to me, and pre-Unicode 7.0.0
    versions of IndicSyllabicCategory.txt, which omit the udatta and
    anudatta completely.


More information about the HarfBuzz mailing list