[HarfBuzz] adjustment to merge_clusters?

Thu Dec 10 03:51:22 PST 2015

Hi Jonathan,

Sorry for the delay.  I've been thinking about this for multiple days.
Initially my dislike for this proposal was on several principles:

- Using glyph-props in hb-buffer is a layering violation,

- Since we are in cluster-level=1 anyway, why include marks forward?

- Why extend backward?  One can equally easily build a font that ligates
forward, instead of backward, and you will have the same problem,

- So this becomes more about not merging clusters at all, which is indeed
cluster-level=3.  The problem is, if we do that, it's not clear to me, or I
suppose to anyone, what the cluster values mean anymore.

Currently, there's a systematic description for what the cluster values mean:
"these glyphs represent those characters and we don't know anything more
granular."  With the suggested patch, the cluster values don't mean anything
anymore.  Indeed, because a glyph from one cluster leaked into another cluster
and we're not telling that to the client.

BTW, I see Uniscribe returns a different result (equally "wrong" as HarfBuzz's):

$ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf
[lam.l=3+1107|blank=3+1|sa.l=0+1094|space=0+1]

$ hb-unicode-encode 20,633,627,644 | hb-shape.exe JNN.ttf --shaper=uniscribe
[lam.l=3+1107|blank=3+1|sa.l=2+1094|space=0+1]

I suppose when the ligature for sa.l multiplied, Uniscribe assumed that it has
decomposed to it's original components.

Anyway, today I found a use-case that will definitely go wrong with your
suggested patch.  Imagine another Nastaliq font, that initially decomposes
each letter to a body, and a connector, in that order.  In a following lookup,
the connectors might ligate with the body glyph after them.  With your
suggested patch, we end up allocating the letter bodies to the cluster of
their previous letter, which is clearly wrong and will result in incorrect
cursoring.

I think we might need to look outside the lookups for clues as to what's
actually going on, so we can distinguish these two legitimate cases from
eachother.  Eg. when a glyph multiplies into many, we want to know which
components represent the main body of the letters and which are "side
components".  Looking at glyph advance widths is a good heuristic, but is
undesirable during substitution.  How about, we look at the GDEF class of
Component=4?  That's currently not used for anything AFAIK.  I don't know how
the implementation will look like, but it's definitely possible to tell, eg,
JNN developers, to give the blank glyph a GDEF class of Component...

WDYT?
behdad

On 15-11-30 08:30 AM, Jonathan Kew wrote:
> Hey Behdad,
> 
> I'm wondering if we can make merge_clusters a little more conservative....?
> 
> Here's the scenario:
> 
> Assume we start with two independent base glyphs with distinct cluster numbers:
> 
>   <glyphA.0, glyphB.1>
> 
> Then a MultipleSubst lookup expands glyphB to two parts, which both inherit
> glyphB's cluster value:
> 
>   <glyphA.0, glyphB1.1, glyphB2.1>
> 
> Next, a LigatureSubst lookup combines glyphA with glyphB1. Currently, because
> merge_clusters extends its target range to include any following glyphs that
> share the same cluster value as the last one in the range, we'll get:
> 
>   <glyphAB1.0, glyphB2.0>
> 
> which ISTM is less than ideal. It's not clear to me that there's any totally
> "right" result here, but what would seem more useful to me, at least, would be
> to leave glyphB2's cluster untouched:
> 
>   <glyphAB1.0, glyphB2.1>
> 
> (In particular, this would resolve
> https://bugzilla.mozilla.org/show_bug.cgi?id=1212668.)
> 
> I assume we'd still want to extend the end in merge_clusters when the
> following glyph(s) are marks, so could we do something like the attached?
> 
> JK

-- 
behdad
http://behdad.org/