[HarfBuzz] dotted circle is not appearing for dependant vowel

Tue Jul 24 06:18:33 PDT 2012

On 24/7/12 12:51, Shriramana Sharma wrote:
> On Tue, Jul 24, 2012 at 3:26 PM, Pravin Satpute <psatpute at redhat.com> wrote:
>>
>>     I see the dotted circle is still not appearing with dependant vowels
>> (U+093f), Is this intentionally?
>>     Might be since you are removing test cases generating dotted circle
>> in Uniscribe before running it with harfbuzz.
>
> May I take this opportunity to record what I have long felt on the
> topic of dotted circles.
>
> I feel that dotted circles should not be displayed except when not
> doing so can cause non-canonically-equivalent encoded sequences to
> appear the same. That is, they should be displayed only to distinguish
> between such sequences. (This is to protect against phishing and
> such.)

I don't think phishing protection is the responsibility of a shaping 
engine. There are far too many completely legitimate sequences (in both 
"complex" and "simple" scripts) that can be visually confusable.

> For example, the long vowel आ does not have a decomposition to अ+ ा
> whereas it would appear the same as the latter if there is no dotted
> circle. There are many such "do not use" recommendations for
> independent vowels in the Indic Unicode chapters because of the
> absences of canonical equivalences (unfortunate IMO but well....).

Software designed for phishing protection might indeed want to guard 
against such sequences (among many other things); however, I don't think 
this is the shaping engine's job.

> Reordrant vowels like ि are also likewise, because in the case of a
> sequence अिक mistakenly typed (or maliciously introduced) for अकि if
> there is no dotted circle the two sequences would appear the same

This isn't a particularly good example. In my email client, neither of 
them shows a dotted circle, but neither do they look the same. The first 
one displays the i-matra to the left of the full a-vowel; the second 
displays it between the a-vowel and the ka. This seems like a perfectly 
reasonable way to render the two sequences. If there are use cases (as 
has already been mentioned) for multiple vowel matras on a single base 
consonant, why shouldn't there also be use cases for vowel matras placed 
on a full vowel letter as their base?

A pair that could be more problematic would be कि / िक (0915,093F / 
093F,0915). These do display identically here where I'm typing (although 
many systems doubtless insert a dotted circle in the second case).

> which is not appropriate from a security viewpoint as they are not
> canonically equivalent.
>
> My point is, there may be many reasons for unexpected combinations of
> characters in Indic. Vedic texts is one. Minority orthographies is
> (which may use rare combinations of vowel signs and diacritics)
> another. Legitimate creative use (like काााााा) for "kaaaa" (a shout)
> is yet another. Imposing a limited orthography (i.e. only recognizing
> a certain set of patterns of sequences and producing dotted circles
> for sequences that do not fit the pattern) would preclude the
> usefulness of the rendering system to users of such cases.
>
> Of course, this usability can also be achieved by first imposing a
> generic orthography (i.e. script grammar) and later adding more
> recognized sequences as per user community request. (This is also much
> easier to produce and deliver to the community in open source
> ecosystems than in proprietary ones.)
>
> This would be advisable since it may be difficult to predict which
> sequences in Indic would be confusable, especially with non-spacing
> marks. For example, तु and तुु would be confusable if there is no
> dotted circle and the second ु is overlaid upon the first.

A careful font designer can address examples like this by providing 
mark-to-mark positioning rules that will make multiple copies of the 
same mark "stack" rather than simply overprint each other.

Of course, not every font designer will be so careful. But then, not 
every Latin-script font adequately distinguishes 'I', 'l', and '1', 
either. We can't expect shaping engines to somehow make up for visual 
ambiguities in font designs.

>
> But these sequences are not self-obvious, so it appears creating
> regexs for sequences where dotted circles should *not* be produced
> might be easier than to do so where they *should* be produced and it
> would be appropriate to err on the side of caution.

IMO, "to err on the side of caution" in the matter of dotted-circle 
insertion means that we should avoid the risk of blocking a use case 
that someone might someday want, even if we can't anticipate that 
particular need. So, for example, even though we may not be aware of any 
current need for a sequence such as "अिुा", there's no compelling reason 
for a shaping engine to insert dotted circles into it and thus make it 
impossible for a user to encode and render an a-vowel with these three 
matras placed around it.

In general, I think the Indic shaper should *not* insert dotted circles. 
The one exception that I think may be desirable would be the case of 
left-reordrant matras when no usable base character (either consonant or 
vowel letter, or other "placeholder" such as an explicit U+25cc or a 
space, no-break space, etc) can be found. In this case inserting a 
dotted circle (or a space?) to act as the base, and then reordering the 
matra to the left of it, may be the best option, so that a "visually 
encoded" sequence िक does not appear identical to the correctly-encoded कि.

>
> I had to say this, being a scholar of Sanskrit and Vedic, which really
> puts scripts (and hence software support for them) to their limit.
> Pravin (OP on this thread) and I, we have plans for developing a Lohit
> Devanagari Vedic font, so we'll be coming back on this...
>