[HarfBuzz] adjustment to merge_clusters?

Thu Dec 10 07:53:47 PST 2015

On 10/12/15 06:51, Behdad Esfahbod wrote:
> Hi Jonathan,
>
> Sorry for the delay.  I've been thinking about this for multiple days.
> Initially my dislike for this proposal was on several principles:
>
> - Using glyph-props in hb-buffer is a layering violation,

Yes, I figured that would be distasteful.

>
> - Since we are in cluster-level=1 anyway, why include marks forward?

If we're ligating two bases in a sequence such as

   <baseA.0, baseB.1, mark.1>

and don't include marks forward, we'd end up with

   <ligAB.0, mark.1>

which splits the mark from the base to which it's applied. Including 
marks forward avoids this.

>
> - Why extend backward?  One can equally easily build a font that ligates
> forward, instead of backward, and you will have the same problem,

Hmm. I'm not sure I am visualizing the problem scenario you have in mind 
here.

AFAICS, extending backward can only be relevant when reordering has 
happened (so that there's a lower cluster value somewhere within the 
start..end range than the current cluster value of the start glyph -- 
e.g. because start is a left-matra that we've just moved to the front of 
the syllable).

(Aside: maybe it would be a useful micro-optimization to distinguish two 
versions of merge_clusters; one that is used when the shaper (e.g. Indic 
or USE) has reordered things, and does the scan-for-minimum and 
extend-backwards stuff, and a simpler method for use when ligating, 
which doesn't need to do that. This version wouldn't need to do the 
start-of-buffer and continue-in-outbuf check, either.)

>
> - So this becomes more about not merging clusters at all, which is indeed
> cluster-level=3.  The problem is, if we do that, it's not clear to me, or I
> suppose to anyone, what the cluster values mean anymore.
>
> Currently, there's a systematic description for what the cluster values mean:
> "these glyphs represent those characters and we don't know anything more
> granular."  With the suggested patch, the cluster values don't mean anything
> anymore.  Indeed, because a glyph from one cluster leaked into another cluster
> and we're not telling that to the client.

Yeah, I agree this makes the meaning of "cluster" less well-defined. 
Though it's not clear to me how far this is really a problem...

>
> BTW, I see Uniscribe returns a different result (equally "wrong" as HarfBuzz's):
>
> $ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf
> [lam.l=3+1107|blank=3+1|sa.l=0+1094|space=0+1]
>
> $ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf --shaper=uniscribe
> [lam.l=3+1107|blank=3+1|sa.l=2+1094|space=0+1]
>
> I suppose when the ligature for sa.l multiplied, Uniscribe assumed that it has
> decomposed to it's original components.
>
>
> Anyway, today I found a use-case that will definitely go wrong with your
> suggested patch.  Imagine another Nastaliq font, that initially decomposes
> each letter to a body, and a connector, in that order.  In a following lookup,
> the connectors might ligate with the body glyph after them.  With your
> suggested patch, we end up allocating the letter bodies to the cluster of
> their previous letter, which is clearly wrong and will result in incorrect
> cursoring.

So the scenario runs something like

   <letterA.0, letterB.1>

   <bodyA.0, connectorA.0, bodyB.1, connectorB.1>

   <bodyA.0, joinedBodyB.0, connectorB.1>

Yes, that's not ideal. :( Though whether it'll result in worse user 
experience than the current

   <bodyA.0, joinedBodyB.0, connectorB.0>

may be hard to say.

>
> I think we might need to look outside the lookups for clues as to what's
> actually going on, so we can distinguish these two legitimate cases from
> eachother.  Eg. when a glyph multiplies into many, we want to know which
> components represent the main body of the letters and which are "side
> components".  Looking at glyph advance widths is a good heuristic, but is
> undesirable during substitution.  How about, we look at the GDEF class of
> Component=4?  That's currently not used for anything AFAIK.  I don't know how
> the implementation will look like, but it's definitely possible to tell, eg,
> JNN developers, to give the blank glyph a GDEF class of Component...
>
> WDYT?

Without having tried to think it through in detail, that sounds like a 
promising idea. Worth hacking up an implementation to test, maybe?

JK

> behdad
>
> On 15-11-30 08:30 AM, Jonathan Kew wrote:
>> Hey Behdad,
>>
>> I'm wondering if we can make merge_clusters a little more conservative....?
>>
>> Here's the scenario:
>>
>> Assume we start with two independent base glyphs with distinct cluster numbers:
>>
>>    <glyphA.0, glyphB.1>
>>
>> Then a MultipleSubst lookup expands glyphB to two parts, which both inherit
>> glyphB's cluster value:
>>
>>    <glyphA.0, glyphB1.1, glyphB2.1>
>>
>> Next, a LigatureSubst lookup combines glyphA with glyphB1. Currently, because
>> merge_clusters extends its target range to include any following glyphs that
>> share the same cluster value as the last one in the range, we'll get:
>>
>>    <glyphAB1.0, glyphB2.0>
>>
>> which ISTM is less than ideal. It's not clear to me that there's any totally
>> "right" result here, but what would seem more useful to me, at least, would be
>> to leave glyphB2's cluster untouched:
>>
>>    <glyphAB1.0, glyphB2.1>
>>
>> (In particular, this would resolve
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1212668.)
>>
>> I assume we'd still want to extend the end in merge_clusters when the
>> following glyph(s) are marks, so could we do something like the attached?
>>
>> JK
>