[HarfBuzz] Fwd: Harfbuzz with linebreaking

Kelvin Ma kelvinsthirteen at gmail.com
Tue Jun 14 17:20:31 UTC 2016


Okay, so…supposing we had safe-to-break, would this system work?:

The input string is this (where 'REVERSED TEXT' is a piece of right to left
text):

      'Tree Paine’s primary office is in Nashville (REVERSED TEXT). She
works as a publicist.'

This gets split into same-direction segments

    → 'Tree Paine’s primary office is in Nashville (',
    ← 'REVERSED TEXT',
    → '). She works as a publicist.'

Then hb.shape() is called on each segment to get glyphs:

    → [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p] [r] [i] [m]
[a] [r] [y] [ ] [o] [ffi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s]
[h] [v] [i] [l] [l] [e] [ ] [(]

      [T] [X] [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←

    → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [
] [p] [u] [b] [l] [i] [c] [i] [st] [.]

My first line is 50 points wide, and I find that the first glyph that
exceeds that is the [e] at the end of

    → [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p] [r] [i] [m]
[a] [r] [y] [ ] [o] [ffi] [c] [e]

I know from clustering that [e] corresponds to index 26 in the original
string. So I look for the breakpoint with the highest index less than 26,
i.e. one that’s within the string 'Tree’s primary office'.
That breakpoint is the hyphenation breakpoint at index 22 ('Tree Paine’s
primary of' + '-')

Then I reshape the string 'Tree Paine’s primary of-', using safe-to-break
to reuse old glyphs when possible, to get the line

      <line 1> : { [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p]
[r] [i] [m] [a] [r] [y] [ ] [o] [f] [-] }

By the same process, I use the remaining string to get the remaining glyphs
(note how the ligature [ffi] got broken into [f] and [fi])

    → [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l]
[l] [e] [ ] [(]

      [T] [X] [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←

    → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [
] [p] [u] [b] [l] [i] [c] [i] [st] [.]

I build my second line starting from the [fi] glyph, but I find that the
last glyph in this run, the [(] , still leaves extra space at the end of
the line. The line

    → [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l]
[l] [e] [ ] [(]

is only 38 points long, leaving 12 points of space. So I go on to the next
run, the RTL segment. I find that 12 points of space is only enough to fit
the glyphs

      [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←

Again, clustering tells me this corresponds to index 55 in the original
string ('Tree Paine’s primary office is in Nashville (REVERSED TE'), and
the last breakpoint less than index 55 is the whitespace breakpoint at
index 53. So I shape the RTL string 'REVERSED ' and add it to the 'fice is
in Nashville (' I had from before to get line 2:

      <line 2> :  { [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s]
[h] [v] [i] [l] [l] [e] [ ] [(] }  { [D] [E] [S] [R] [E] [V] [E] [R] } { [
] }

(The braces represent reversals in text direction) Then, the remainder,
shaped from the string 'TEXT', '). She works as a publicist.' leaves

      [T] [X] [E] [T] ←

    → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [
] [p] [u] [b] [l] [i] [c] [i] [st] [.]

The glyphs [T] [X] [E] [T] take up just 7 points of space, leaving 43
points to fit the last LTR run. So in the last run, I find that the [i] is
the first glyph to overrun the 43 points of space, which leads us to divide
the string into '). She works as a pub-', and 'licist.', and shape the two
accordingly. (Remember that at no time did we literally break at the glyphs
[b] and [l], we broke the original string using the cluster value of the [i].)
The shaped glyphs created from the first string get added to the [T] [X]
[E] [T] glyphs we already had in the line, and the shaped glyphs from the
second string go into line 4. So the end result is:


      <line 1> : { [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p]
[r] [i] [m] [a] [r] [y] [ ] [o] [f] [-] }

      <line 2> : { [fi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h]
[v] [i] [l] [l] [e] [ ] [(] }  { [D] [E] [S] [R] [E] [V] [E] [R] } { [ ] }

      <line 3> : { [T] [X] [E] [T] }  { [)] [.] [ ] [S] [h] [e] [ ] [w] [o]
[r] [k] [s] [ ] [a] [s] [ ] [a] [ ] [p] [u] [b] [-] }

      <line 4> : { [l] [i] [c] [i] [st] [.] }

Does this make sense?

On Tue, Jun 14, 2016 at 2:09 AM, <kelvinsthirteen at gmail.com> wrote:

>
>
> > On Jun 14, 2016, at 1:29 AM, Simon Cozens <simon at simon-cozens.org>
> wrote:
> >
> >> On 14/06/2016 14:23, kelvinsthirteen at gmail.com wrote:
> >> each word has at least
> >> one (often many) breakpoints, but only one of them gets used per
> >> line.
> >
> > Right.
> >
> >> And the only way to know which one to use is to shape.
> >
> > Well, no. You shaped already; that was the first thing you did. As Adam
> > told you, the only way to know which breakpoint to *use* is to run a
> > justification algorithm, which you need to code. You're currently
> > thinking about a simple first-fit algorithm, which chops the glyphs into
> > lines once they get to be longer than the target line length; that's
> > fine, although you may find that a best-fit algorithm which performs
> > dynamic programming over all possible breakpoints gives you a neater
> > paragraph.
>
> I think here lies the confusion between us, we’re talking about completely
> different layout systems. SILE uses algorithmic layout, which is fine and
> good for a *compiled* "here’s the text source, do your magic on it i don’t
> care" layout engine, but is bad for visual feedback because editing gets
> extremely chaotic when done live.
>
> Knockout’s emphasis is on predictability and a layout model that the user
> can understand—text hits a wall, the last word gets broken into pieces
> (hyphenation) and it broken-off chunk falls onto the next line. The text is
> never allowed to "exceed" the limits or "borrow space" like in quantum
> physics or something. And when you edit it shouldn't affect anything before
> the cursor. Because when you insert a letter "s" somewhere in your
> paragraph, you should be able to have a good idea in your head of what’s
> going to happen.
>
> Now with ligatures and contextual substitution it’s not 100% possible,
> there will be some small back-ripples, but the principle still applies. You
> shouldn’t scramble the whole paragraph because the algorithm changed its
> mind.
>
> Of course you can argue algorithmic layout is better than physical layout,
> but algorithmic layout’s advantage is really with compiled documents, and
> we already have a nice compiled typesetter—SILE. When you edit live,
> physical layout “pisses you off” less than algorithmic layout.
>
> >
> > Now, shaping determines the glyph widths for you (which is the input to
> > your line breaking algorithm), but it is your code which is responsible
> > for finding the *possible* breakpoints in the text, at the language
> > level, and your code which is responsible for determining the *actual*
> > breakpoints at the shaped-glyph level.
> >
> > Here we go then. If you want to use Harfbuzz to shape lines into
> > paragraphs, here is what you need to do:
> >
> > * Perform the first stage of the bidi algorithm to organise the text
> > into same-direction runs. (Really, don't leave this bit out, and don't
> > think "I'll add RTL/complex script support later", because that never
> > works out well and because we already have enough typesetters that only
> > handle Latin.) ICU does this.
> >
>
> I have almost zero knowledge of any scripts besides latin-greek-cyrillic
> other than that they are written "backwards" so you gonna have to explain
> how all this works to me & exactly what a bidi algorithm does.
>
> > * Shape the paragraph, keeping note of the connection between the glyphs
> > and the text. Harfbuzz does this.
> >
> > * Find the breakpoints in the text, using language analysis on the
> > characters. ICU does this.
> >
> > * Create a data structure representing the runs of Harfbuzz output
> > between the breakpoints - TeX and SILE call these runs "nnodes" - and
> > the potential breakpoints themselves - "penalty nodes" (for breaking
> > inside a "word") and "glue nodes" (for whitespace between "words").
> > Assign widths to the nnodes by summing the widths of the shaped glyphs
> > inside them. You can put each glyph into its own nnode instead of
> > consolidating each run into an nnode if it's conceptually easier, but it
> > just slows your justification algorithm down.
>
> Again, algorithmic layout model. In the physical layout model, kerning
> with space glyphs matters, and justification is done by dividing up all the
> remaining space equally & padding each space glyph. You can’t just replace
> them with "glue nodes"
>
> >
> > Here's what my data structure looks like at this stage:
> >
> >
> N<19.71pt>(Take)G<2.6pt>N<22.06pt>(these)G<2.6pt>N<15.37pt>(five)G<2.6pt>N<40.42pt>(sentences)G<2.6pt>N<25.17pt>(which)G<2.6pt>N<2.97pt>(I)G<2.6pt>N<19.95pt>(need)G<2.6pt>N<8.47pt>(to)G<2.6pt>N<23.24pt>(break)G<2.6pt>N<16.69pt>(into)G<2.6pt>N<4.58pt>(a)G<2.6pt>N<42.68pt>(paragraph)N<2.29pt>(.)
> >
> > (Each nnode also contains a list of glyph IDs and widths.) Each of the
> > glue nodes are potential break points; these were obtained by checking
> > the Unicode line break status of each character. The space character
> > 0x20 is breakable, so it gets turned into a glue node.
> >
> > * Run your justification algorithm to determine which breakpoints should
> > be used. Your code does this.
> >
> > * If the algorithm does not produce a tight enough paragraph, break open
> > the nnodes by hyphenating the text, reshaping them into new nnodes, and
> > putting a discretionary breakpoint in the middle.
> >
> > Now it looks like this:
> >
> >
> N<19.71pt>(Take)G<2.64pt>N<22.06pt>(these)G<2.64pt>N<15.37pt>(five)G<2.64pt>N<13.99pt>(sen)D(N<3.36pt>(-)||)N<26.43pt>(tences)G<2.64pt>N<25.17pt>(which)G<2.64pt>N<2.97pt>(I)G<2.64pt>N<19.95pt>(need)G<2.64pt>N<8.47pt>(to)G<2.64pt>N<23.24pt>(break)G<2.64pt>N<16.69pt>(into)G<2.64pt>N<4.58pt>(a)G<2.64pt>N<18.43pt>(para)D(N<3.36pt>(-)||)N<24.24pt>(graph)N<2.29pt>(.)
>
> You are inserting a hyphen glyph. You have to reshape and retest for
> overflow. (actually you’d have to do this no matter what because if you’re
> considering breaking the line, you'd have to reshape the broken text in two
> separate pieces instead of as one string)
>
> >
> > * Run your justification algorithm again on this new node list.
> >
> > On a 100pt column, my algorithm determined that the line breaks are at
> > position 10 and position 22 of the node list array.
>
> Hyphens only exist at linebreaks. So how did you know to hyphenate the
> words at position 10 and 22 before you knew the linebreaks? Or did you
> insert *all* the possible hyphens and then remove all but those two? Either
> way you have to reshape, and if you reshape, your line break computations
> aren’t valid anymore and we are back to where we started.
>
> >
> > * Organise your node list into a list of lines, based on the breakpoints
> > that were fired.
> >
> > I split my node list at positions 10 and 22, so my lines are:
> >
> >
> N<19.71pt>(Take)G<2.64pt>N<22.06pt>(these)G<2.64pt>N<15.37pt>(five)G<2.64pt>N<13.99pt>(sen)D(N<3.36pt>(-)||)N<26.43pt>(tences)G<2.6pt>
> >
> >
> N<25.17pt>(which)G<2.6pt>N<2.97pt>(I)G<2.6pt>N<19.95pt>(need)G<2.6pt>N<8.47pt>(to)G<2.6pt>N<23.24pt>(break)G<2.6pt>N<16.69pt>(into)
> >
> >
> N<4.58pt>(a)G<2.6pt>N<18.43pt>(para)D(N<3.36pt>(-)||)N<24.24pt>(graph)N<2.29pt>(.)
> >
> > * For each line in the paragraph, apply the second part of the bidi
> > algorithm (ICU does this) and reshape where necessary. This splits and
> > recombines ligatures correctly. (I promise; we have a test case to prove
> > this.)
> >
> > You only need to determine line breaks once, and you only need to
> > reshape once per line maximum. I'm not going to argue about whether it
> > works or not, because you can check out the code and the test suite for
> > yourself: https://github.com/simoncozens/sile
> >
> >> In fact I don’t see any other way to do it
> >
> > You need to put aside the idea that there is a connection between
> > shaping and determining which breakpoints to use. There isn't one, and
> > this is the mental block which is stopping you from seeing solutions to
> > your problem.
> >
> > Simon
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20160614/8f821683/attachment-0001.html>


More information about the HarfBuzz mailing list