[HarfBuzz] adjustment to merge_clusters?
Jonathan Kew
jfkthame at gmail.com
Thu Dec 10 07:53:47 PST 2015
On 10/12/15 06:51, Behdad Esfahbod wrote:
> Hi Jonathan,
>
> Sorry for the delay. I've been thinking about this for multiple days.
> Initially my dislike for this proposal was on several principles:
>
> - Using glyph-props in hb-buffer is a layering violation,
Yes, I figured that would be distasteful.
>
> - Since we are in cluster-level=1 anyway, why include marks forward?
If we're ligating two bases in a sequence such as
<baseA.0, baseB.1, mark.1>
and don't include marks forward, we'd end up with
<ligAB.0, mark.1>
which splits the mark from the base to which it's applied. Including
marks forward avoids this.
>
> - Why extend backward? One can equally easily build a font that ligates
> forward, instead of backward, and you will have the same problem,
Hmm. I'm not sure I am visualizing the problem scenario you have in mind
here.
AFAICS, extending backward can only be relevant when reordering has
happened (so that there's a lower cluster value somewhere within the
start..end range than the current cluster value of the start glyph --
e.g. because start is a left-matra that we've just moved to the front of
the syllable).
(Aside: maybe it would be a useful micro-optimization to distinguish two
versions of merge_clusters; one that is used when the shaper (e.g. Indic
or USE) has reordered things, and does the scan-for-minimum and
extend-backwards stuff, and a simpler method for use when ligating,
which doesn't need to do that. This version wouldn't need to do the
start-of-buffer and continue-in-outbuf check, either.)
>
> - So this becomes more about not merging clusters at all, which is indeed
> cluster-level=3. The problem is, if we do that, it's not clear to me, or I
> suppose to anyone, what the cluster values mean anymore.
>
> Currently, there's a systematic description for what the cluster values mean:
> "these glyphs represent those characters and we don't know anything more
> granular." With the suggested patch, the cluster values don't mean anything
> anymore. Indeed, because a glyph from one cluster leaked into another cluster
> and we're not telling that to the client.
Yeah, I agree this makes the meaning of "cluster" less well-defined.
Though it's not clear to me how far this is really a problem...
>
> BTW, I see Uniscribe returns a different result (equally "wrong" as HarfBuzz's):
>
> $ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf
> [lam.l=3+1107|blank=3+1|sa.l=0+1094|space=0+1]
>
> $ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf --shaper=uniscribe
> [lam.l=3+1107|blank=3+1|sa.l=2+1094|space=0+1]
>
> I suppose when the ligature for sa.l multiplied, Uniscribe assumed that it has
> decomposed to it's original components.
>
>
> Anyway, today I found a use-case that will definitely go wrong with your
> suggested patch. Imagine another Nastaliq font, that initially decomposes
> each letter to a body, and a connector, in that order. In a following lookup,
> the connectors might ligate with the body glyph after them. With your
> suggested patch, we end up allocating the letter bodies to the cluster of
> their previous letter, which is clearly wrong and will result in incorrect
> cursoring.
So the scenario runs something like
<letterA.0, letterB.1>
<bodyA.0, connectorA.0, bodyB.1, connectorB.1>
<bodyA.0, joinedBodyB.0, connectorB.1>
Yes, that's not ideal. :( Though whether it'll result in worse user
experience than the current
<bodyA.0, joinedBodyB.0, connectorB.0>
may be hard to say.
>
> I think we might need to look outside the lookups for clues as to what's
> actually going on, so we can distinguish these two legitimate cases from
> eachother. Eg. when a glyph multiplies into many, we want to know which
> components represent the main body of the letters and which are "side
> components". Looking at glyph advance widths is a good heuristic, but is
> undesirable during substitution. How about, we look at the GDEF class of
> Component=4? That's currently not used for anything AFAIK. I don't know how
> the implementation will look like, but it's definitely possible to tell, eg,
> JNN developers, to give the blank glyph a GDEF class of Component...
>
> WDYT?
Without having tried to think it through in detail, that sounds like a
promising idea. Worth hacking up an implementation to test, maybe?
JK
> behdad
>
> On 15-11-30 08:30 AM, Jonathan Kew wrote:
>> Hey Behdad,
>>
>> I'm wondering if we can make merge_clusters a little more conservative....?
>>
>> Here's the scenario:
>>
>> Assume we start with two independent base glyphs with distinct cluster numbers:
>>
>> <glyphA.0, glyphB.1>
>>
>> Then a MultipleSubst lookup expands glyphB to two parts, which both inherit
>> glyphB's cluster value:
>>
>> <glyphA.0, glyphB1.1, glyphB2.1>
>>
>> Next, a LigatureSubst lookup combines glyphA with glyphB1. Currently, because
>> merge_clusters extends its target range to include any following glyphs that
>> share the same cluster value as the last one in the range, we'll get:
>>
>> <glyphAB1.0, glyphB2.0>
>>
>> which ISTM is less than ideal. It's not clear to me that there's any totally
>> "right" result here, but what would seem more useful to me, at least, would be
>> to leave glyphB2's cluster untouched:
>>
>> <glyphAB1.0, glyphB2.1>
>>
>> (In particular, this would resolve
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1212668.)
>>
>> I assume we'd still want to extend the end in merge_clusters when the
>> following glyph(s) are marks, so could we do something like the attached?
>>
>> JK
>
More information about the HarfBuzz
mailing list