[HarfBuzz] Beginner question: What are cluster levels?

Fri Jan 8 08:47:14 PST 2016

Ok, let me attack this head on.  Hopefully someone will lift the
information here and put it in the documentation.

When you add text to a HB buffer, each character is associated a cluster
value.  This is an arbitrary number as far as HB is concerned.  Most
clients will use UTF-8, UTF-16, or UTF-32 indices.  The actual number does
not matter.  Moreover, it is not required that the cluster values be
monotonically increasing, but pretty much all our testing is with such
cluster numbers.  Though, there's no such assumption in the code itself.
With that in mind, I'm going to explain what happens with cluster numbers
during shaping under each cluster-level.

The conceptual model for what the clusters mean, in the monotone (level 0
and 1) modes is this: cluster values will always remain monotone.  These
represent a number of clusters, to each belongs one or more glyph and one
or more characters.  Assuming that initial cluster numbers were
monotonically increasing and distinct, then all (adjacent) glyphs having
the same cluster number belong to that cluster, and all characters belong
to the cluster that has the highest number not larger than their initial
cluster number.  This will become clear with an example:

Let's say we start with the following character sequence and cluster values:

  A,B,C,D,E
  0,1,2,3,4

We then map chars to glyphs and get these glyphs (looks same as the chars):

  A,B,C,D,E
  0,1,2,3,4

Now, if for example, B and C ligate, then the clusters to which they belong
"merge".  The merged cluster gets the number that is the minimum of the
cluster number of the clusters that went in.  In this case, we get:

  A,BC,D,E
  0,1 ,3,4

Now let's assume that the BC glyph decomposes into three components, and D
also decomposes into two.  The components all inherit the cluster value of
the parent:

  A,BC0,BC1,BC2,D0,D1,E
  0,1  ,1  ,1  ,2 ,2 ,3

Now if BC2 and D0 ligate, then their clusters (1, 2) merge and into
min(1,2)=1:

  A,BC0,BC1,BC2D0,D1,E
  0,1  ,1  ,1  ,1 ,1 ,3

What the cluster 1 means at this point is this: character sequence "BCD" is
represented by glyphs "BC0,BC1,BC2D0,D1" and I can't break it down any
further.

Another common operation in the more complex shapers is when things
reorder.  In those cases, to maintain monotone clusters, HB merges the
clusters of everything in the reordering sequencec.  Eg:

  A,B,C,D,E
  0,1,2,3,4

if D is reordered before B, then we get:

  A,D,B,C,E
  0,1,1,1,4

this is clearly not ideal, but the only sesible way to maintain monotone
indices and the true relationship between glyphs and characters.

So, the above, is pretty much what the cluster levels 0 and 1 do.  The only
different between the two is: in level 0, at the very beginning of the
shaaping, we also merge clusters between base characters and all Unicode
marks (combining or not) following them.  Eg.:

  A,acute,B
  0,1,2

will become:

  A,acute,B
  0,0,2

This is the default behavior.  We do it, because Windows did it and old
HarfBuzz did it, so this remained the default.  But it makes it impossible
to color diacritic marks differently from their base characters, that's why
in level 1, we don't do this.  For clients, level 0 is more convenient if
they rely on HarfBuzz clusters for cursor positioning.  But that's wrong
anyway: cursor positions should be determined based on Unicode grapheme
boundaries, NOT shaping clusters.  As such, level 1 clusters are preferred.

One last note about levels 0 and 1.  We currently don't allow a
MultipleSubst lookup to replace a glyph with zero glyphs (ie, delete a
glyph).  But in some other situations, glyphs can be deleted.  In those
cases, we make sure to merge cluster with a neighboring cluster if the
glyph being deleted is the last glyph of it's cluster.  This is, for the
main part, to make sure that the starting cluster of the text always have
the cluster index pointing to the start of the text for the run; more than
one client I know relies on this.  Incidentally, CoreText does something
else to maintain the same promise: it inserts a glyph with id 65535 at the
beginning of the glyph string if the glyph corresponding to the first
character in the run was deleted.  We might do something similar in the
future.

Level 2 is a different beast; simple to describe, hard to make sense of.
It simply doesn't do any cluster merging whatsoever.  When things ligate or
otherwise multiple glyphs turn into one, the cluster number of the first
one is retained. Here are a few examples of why processing numbers produced
at this level might be tricky:

- Ligature with combining marks:  Imagine capital letters are bases and
lower case letters are combining marks.  With input sequence like this:

  A,a,B,b,C,c
  0,1,2,3,4,5

if A,B,C ligate, then here's what cluster numbers one would get under the
various levels:

level 0:

  ABC,a,b,c
  0  ,0,0,0

level 1:

  ABC,a,b,c
  0  ,0,0,5

level 2:

  ABC,a,b,c
  0  ,1,3,5

making sense of the last one is hardest for a clientn, because there's
nothing in the cluster numbers suggesting that B and C ligated with A.

* Another tricky case is when things reorder.  Under level 2:

  A,B,C,D,E
  0,1,2,3,4

imagine D moves before B:

  A,D,B,C,E
  0,3,1,2,4

now if D ligates with B, we get:

  A,DB,C,E
  0,3 ,2,4

in a different scenario, A and B could have ligated before D reordered;
that would have resulted in:

  AB,D,C,E
  0 ,3,2,4

there's no way to differentitate between these two scenarios based on the
cluster numbers alone.

Another problem appens with ligatures under level 2 if direction of text is
forced to opposite of its natural direction (eg left-to-right Arabic).  But
that's too much of a corner case.

That's how things work right now.  Level 1 is most useful;  Level 2 is hard
to use.  If we were to improve the cluster mapping, we need to go beyond
using single numbers.  I'll propose one way to do that in a followup
message.

Cheers,
behdad
On Jan 1, 2016 3:53 AM, "Deepak Jois" <deepak.jois at gmail.com> wrote:

> In an off-list discussion with Khaled Hosny regarding a specific case
> of shaping text with Harfbuzz, he suggested setting cluster levels to
> a certain value to improve handling of values returned by Harfbuzz.
>
> I can see that the code has three possible values for cluster levels:
>
> *  HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES = 0
> *  HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS = 1
> * HB_BUFFER_CLUSTER_LEVEL_CHARACTERS = 2
>
> I looked at the docs, and there doesn’t seem to be any kind of explanation.
>
> So my question is:
>
> 1. What exactly are cluster levels?
> 2. What do the above values mean?
>
> Thanks
> Deepak
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/harfbuzz/attachments/20160108/b6453ba4/attachment.html>