[HarfBuzz] adjustment to merge_clusters?

Behdad Esfahbod behdad at behdad.org
Thu Dec 10 08:26:01 PST 2015


On 15-12-10 04:53 PM, Jonathan Kew wrote:
> On 10/12/15 06:51, Behdad Esfahbod wrote:
>>
>> - Since we are in cluster-level=1 anyway, why include marks forward?
> 
> If we're ligating two bases in a sequence such as
> 
>   <baseA.0, baseB.1, mark.1>

But at cluster-level=1 we'd be at:

   <baseA.0, baseB.1, mark.2>

ie, the mark as it's own cluster.  If it had ended up as mark.1 it's because
it ligated and separate from its base, at which point why differentiate this
from a base that ligated and separated from its previous base?

I think what I'm saying is that if we take your approach, we should only do
the mark-expansion if cluster-level=0.


> and don't include marks forward, we'd end up with
> 
>   <ligAB.0, mark.1>
> 
> which splits the mark from the base to which it's applied. Including marks
> forward avoids this.
> 
>>
>> - Why extend backward?  One can equally easily build a font that ligates
>> forward, instead of backward, and you will have the same problem,
> 
> Hmm. I'm not sure I am visualizing the problem scenario you have in mind here.
> 
> AFAICS, extending backward can only be relevant when reordering has happened
> (so that there's a lower cluster value somewhere within the start..end range
> than the current cluster value of the start glyph -- e.g. because start is a
> left-matra that we've just moved to the front of the syllable).

No, it doesn't have to do with reordering at all.  In the original example, it
was something like:

   <A.0, B.1>
-> <A.0, Bx.1, By.1>
-> <ABx.0, By.0>

whereas you want:

-> <ABx.0, By.1>

What I'm saying is that the multiplication can happen at the first glyph and
ligation with the second:

   <A.0, B.1>
-> <Ax.0, Ay.0, B.1>
-> <Ax.1, AyB.1>

whereas, with the same logic, this would be desired:

-> <Ax.0, AyB.1>


> (Aside: maybe it would be a useful micro-optimization to distinguish two
> versions of merge_clusters; one that is used when the shaper (e.g. Indic or
> USE) has reordered things, and does the scan-for-minimum and extend-backwards
> stuff, and a simpler method for use when ligating, which doesn't need to do
> that. This version wouldn't need to do the start-of-buffer and
> continue-in-outbuf check, either.)

It's easier to reason about the one version we have...  Even when ligating we
need to look back in outbuf, indeed, in a case similar to what I showed above
(ie, the first glyph in the ligature is not the first glyph in its cluster.)


>> - So this becomes more about not merging clusters at all, which is indeed
>> cluster-level=3.  The problem is, if we do that, it's not clear to me, or I
>> suppose to anyone, what the cluster values mean anymore.
>>
>> Currently, there's a systematic description for what the cluster values mean:
>> "these glyphs represent those characters and we don't know anything more
>> granular."  With the suggested patch, the cluster values don't mean anything
>> anymore.  Indeed, because a glyph from one cluster leaked into another cluster
>> and we're not telling that to the client.
> 
> Yeah, I agree this makes the meaning of "cluster" less well-defined. Though
> it's not clear to me how far this is really a problem...
> 
>> BTW, I see Uniscribe returns a different result (equally "wrong" as
>> HarfBuzz's):
>>
>> $ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf
>> [lam.l=3+1107|blank=3+1|sa.l=0+1094|space=0+1]
>>
>> $ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf --shaper=uniscribe
>> [lam.l=3+1107|blank=3+1|sa.l=2+1094|space=0+1]
>>
>> I suppose when the ligature for sa.l multiplied, Uniscribe assumed that it has
>> decomposed to it's original components.
>>
>> Anyway, today I found a use-case that will definitely go wrong with your
>> suggested patch.  Imagine another Nastaliq font, that initially decomposes
>> each letter to a body, and a connector, in that order.  In a following lookup,
>> the connectors might ligate with the body glyph after them.  With your
>> suggested patch, we end up allocating the letter bodies to the cluster of
>> their previous letter, which is clearly wrong and will result in incorrect
>> cursoring.
> 
> So the scenario runs something like
> 
>   <letterA.0, letterB.1>
> 
>   <bodyA.0, connectorA.0, bodyB.1, connectorB.1>
> 
>   <bodyA.0, joinedBodyB.0, connectorB.1>
> 
> Yes, that's not ideal. :( Though whether it'll result in worse user experience
> than the current
> 
>   <bodyA.0, joinedBodyB.0, connectorB.0>
> 
> may be hard to say.

In the current code, at least we get total width of the two letters split
between them.  In the proposed code, the first letter will consume the width
of the second letter's body completely and leave the stem only.  Agree,
neither is ideal.


>> I think we might need to look outside the lookups for clues as to what's
>> actually going on, so we can distinguish these two legitimate cases from
>> eachother.  Eg. when a glyph multiplies into many, we want to know which
>> components represent the main body of the letters and which are "side
>> components".  Looking at glyph advance widths is a good heuristic, but is
>> undesirable during substitution.  How about, we look at the GDEF class of
>> Component=4?  That's currently not used for anything AFAIK.  I don't know how
>> the implementation will look like, but it's definitely possible to tell, eg,
>> JNN developers, to give the blank glyph a GDEF class of Component...
>>
>> WDYT?
> 
> Without having tried to think it through in detail, that sounds like a
> promising idea. Worth hacking up an implementation to test, maybe?

Let me think about it.

One problem that makes cluster manipulations tricky, and indeed, why we go to
so much length at merging, is that if the minimum cluster number (ie, start of
text's) disappears from the output glyph buffer, clients will be confused and
probably crash.  Ie, if input is:

  <A.0,B.1,C2>

output like:

  <X.1,Y.2>

will be bad.  That's pretty much the only guarantee we are making.  That's why
we do not allow deleting glyphs currently, and always merge and extend...
What CoreText does, is, if the first cluster is lost, it inserts a dummy glyph
there, something like:

  <gid65535.0,X.1,Y.2>

We can probably do something similar, and that would make it much easier on
the cluster logic without crashing clients.

> JK
> 
>> behdad
>>
>> On 15-11-30 08:30 AM, Jonathan Kew wrote:
>>> Hey Behdad,
>>>
>>> I'm wondering if we can make merge_clusters a little more conservative....?
>>>
>>> Here's the scenario:
>>>
>>> Assume we start with two independent base glyphs with distinct cluster
>>> numbers:
>>>
>>>    <glyphA.0, glyphB.1>
>>>
>>> Then a MultipleSubst lookup expands glyphB to two parts, which both inherit
>>> glyphB's cluster value:
>>>
>>>    <glyphA.0, glyphB1.1, glyphB2.1>
>>>
>>> Next, a LigatureSubst lookup combines glyphA with glyphB1. Currently, because
>>> merge_clusters extends its target range to include any following glyphs that
>>> share the same cluster value as the last one in the range, we'll get:
>>>
>>>    <glyphAB1.0, glyphB2.0>
>>>
>>> which ISTM is less than ideal. It's not clear to me that there's any totally
>>> "right" result here, but what would seem more useful to me, at least, would be
>>> to leave glyphB2's cluster untouched:
>>>
>>>    <glyphAB1.0, glyphB2.1>
>>>
>>> (In particular, this would resolve
>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1212668.)
>>>
>>> I assume we'd still want to extend the end in merge_clusters when the
>>> following glyph(s) are marks, so could we do something like the attached?
>>>
>>> JK
>>
> 
> 

-- 
behdad
http://behdad.org/


More information about the HarfBuzz mailing list