[HarfBuzz] Fwd: Harfbuzz with linebreaking

Tue Jun 14 06:09:40 UTC 2016

> On Jun 14, 2016, at 1:29 AM, Simon Cozens <simon at simon-cozens.org> wrote:
> 
>> On 14/06/2016 14:23, kelvinsthirteen at gmail.com wrote:
>> each word has at least
>> one (often many) breakpoints, but only one of them gets used per
>> line.
> 
> Right.
> 
>> And the only way to know which one to use is to shape.
> 
> Well, no. You shaped already; that was the first thing you did. As Adam
> told you, the only way to know which breakpoint to *use* is to run a
> justification algorithm, which you need to code. You're currently
> thinking about a simple first-fit algorithm, which chops the glyphs into
> lines once they get to be longer than the target line length; that's
> fine, although you may find that a best-fit algorithm which performs
> dynamic programming over all possible breakpoints gives you a neater
> paragraph.

I think here lies the confusion between us, we’re talking about completely different layout systems. SILE uses algorithmic layout, which is fine and good for a *compiled* "here’s the text source, do your magic on it i don’t care" layout engine, but is bad for visual feedback because editing gets extremely chaotic when done live.

Knockout’s emphasis is on predictability and a layout model that the user can understand—text hits a wall, the last word gets broken into pieces (hyphenation) and it broken-off chunk falls onto the next line. The text is never allowed to "exceed" the limits or "borrow space" like in quantum physics or something. And when you edit it shouldn't affect anything before the cursor. Because when you insert a letter "s" somewhere in your paragraph, you should be able to have a good idea in your head of what’s going to happen. 

Now with ligatures and contextual substitution it’s not 100% possible, there will be some small back-ripples, but the principle still applies. You shouldn’t scramble the whole paragraph because the algorithm changed its mind.

Of course you can argue algorithmic layout is better than physical layout, but algorithmic layout’s advantage is really with compiled documents, and we already have a nice compiled typesetter—SILE. When you edit live, physical layout “pisses you off” less than algorithmic layout.

> 
> Now, shaping determines the glyph widths for you (which is the input to
> your line breaking algorithm), but it is your code which is responsible
> for finding the *possible* breakpoints in the text, at the language
> level, and your code which is responsible for determining the *actual*
> breakpoints at the shaped-glyph level.
> 
> Here we go then. If you want to use Harfbuzz to shape lines into
> paragraphs, here is what you need to do:
> 
> * Perform the first stage of the bidi algorithm to organise the text
> into same-direction runs. (Really, don't leave this bit out, and don't
> think "I'll add RTL/complex script support later", because that never
> works out well and because we already have enough typesetters that only
> handle Latin.) ICU does this.
> 

I have almost zero knowledge of any scripts besides latin-greek-cyrillic other than that they are written "backwards" so you gonna have to explain how all this works to me & exactly what a bidi algorithm does.

> * Shape the paragraph, keeping note of the connection between the glyphs
> and the text. Harfbuzz does this.
> 
> * Find the breakpoints in the text, using language analysis on the
> characters. ICU does this.
> 
> * Create a data structure representing the runs of Harfbuzz output
> between the breakpoints - TeX and SILE call these runs "nnodes" - and
> the potential breakpoints themselves - "penalty nodes" (for breaking
> inside a "word") and "glue nodes" (for whitespace between "words").
> Assign widths to the nnodes by summing the widths of the shaped glyphs
> inside them. You can put each glyph into its own nnode instead of
> consolidating each run into an nnode if it's conceptually easier, but it
> just slows your justification algorithm down.

Again, algorithmic layout model. In the physical layout model, kerning with space glyphs matters, and justification is done by dividing up all the remaining space equally & padding each space glyph. You can’t just replace them with "glue nodes"

> 
> Here's what my data structure looks like at this stage:
> 
> N<19.71pt>(Take)G<2.6pt>N<22.06pt>(these)G<2.6pt>N<15.37pt>(five)G<2.6pt>N<40.42pt>(sentences)G<2.6pt>N<25.17pt>(which)G<2.6pt>N<2.97pt>(I)G<2.6pt>N<19.95pt>(need)G<2.6pt>N<8.47pt>(to)G<2.6pt>N<23.24pt>(break)G<2.6pt>N<16.69pt>(into)G<2.6pt>N<4.58pt>(a)G<2.6pt>N<42.68pt>(paragraph)N<2.29pt>(.)
> 
> (Each nnode also contains a list of glyph IDs and widths.) Each of the
> glue nodes are potential break points; these were obtained by checking
> the Unicode line break status of each character. The space character
> 0x20 is breakable, so it gets turned into a glue node.
> 
> * Run your justification algorithm to determine which breakpoints should
> be used. Your code does this.
> 
> * If the algorithm does not produce a tight enough paragraph, break open
> the nnodes by hyphenating the text, reshaping them into new nnodes, and
> putting a discretionary breakpoint in the middle.
> 
> Now it looks like this:
> 
> N<19.71pt>(Take)G<2.64pt>N<22.06pt>(these)G<2.64pt>N<15.37pt>(five)G<2.64pt>N<13.99pt>(sen)D(N<3.36pt>(-)||)N<26.43pt>(tences)G<2.64pt>N<25.17pt>(which)G<2.64pt>N<2.97pt>(I)G<2.64pt>N<19.95pt>(need)G<2.64pt>N<8.47pt>(to)G<2.64pt>N<23.24pt>(break)G<2.64pt>N<16.69pt>(into)G<2.64pt>N<4.58pt>(a)G<2.64pt>N<18.43pt>(para)D(N<3.36pt>(-)||)N<24.24pt>(graph)N<2.29pt>(.)

You are inserting a hyphen glyph. You have to reshape and retest for overflow. (actually you’d have to do this no matter what because if you’re considering breaking the line, you'd have to reshape the broken text in two separate pieces instead of as one string)

> 
> * Run your justification algorithm again on this new node list.
> 
> On a 100pt column, my algorithm determined that the line breaks are at
> position 10 and position 22 of the node list array.

Hyphens only exist at linebreaks. So how did you know to hyphenate the words at position 10 and 22 before you knew the linebreaks? Or did you insert *all* the possible hyphens and then remove all but those two? Either way you have to reshape, and if you reshape, your line break computations aren’t valid anymore and we are back to where we started.

> 
> * Organise your node list into a list of lines, based on the breakpoints
> that were fired.
> 
> I split my node list at positions 10 and 22, so my lines are:
> 
> N<19.71pt>(Take)G<2.64pt>N<22.06pt>(these)G<2.64pt>N<15.37pt>(five)G<2.64pt>N<13.99pt>(sen)D(N<3.36pt>(-)||)N<26.43pt>(tences)G<2.6pt>
> 
> N<25.17pt>(which)G<2.6pt>N<2.97pt>(I)G<2.6pt>N<19.95pt>(need)G<2.6pt>N<8.47pt>(to)G<2.6pt>N<23.24pt>(break)G<2.6pt>N<16.69pt>(into)
> 
> N<4.58pt>(a)G<2.6pt>N<18.43pt>(para)D(N<3.36pt>(-)||)N<24.24pt>(graph)N<2.29pt>(.)
> 
> * For each line in the paragraph, apply the second part of the bidi
> algorithm (ICU does this) and reshape where necessary. This splits and
> recombines ligatures correctly. (I promise; we have a test case to prove
> this.)
> 
> You only need to determine line breaks once, and you only need to
> reshape once per line maximum. I'm not going to argue about whether it
> works or not, because you can check out the code and the test suite for
> yourself: https://github.com/simoncozens/sile
> 
>> In fact I don’t see any other way to do it
> 
> You need to put aside the idea that there is a connection between
> shaping and determining which breakpoints to use. There isn't one, and
> this is the mental block which is stopping you from seeing solutions to
> your problem.
> 
> Simon