<div dir="ltr"><div><div><div><div><div><div>Okay, so…supposing we had safe-to-break, would this system work?:<br><br></div><div>The input string is this (where 'REVERSED TEXT' is a piece of right to left text):<br><br></div><span style="font-family:monospace,monospace"> 'Tree </span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace">Paine</span>’s primary office is in Nashville (REVERSED TEXT). She works as a publicist.'<br><br></span></div><div><span style="font-family:arial,helvetica,sans-serif">This gets split into same-direction segments</span><span style="font-family:monospace,monospace"><br></span><br><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"> → </span>'Tree Paine’s primary office is in Nashville (',<br></span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"> ← </span>'REVERSED TEXT',<br></span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"> → </span></span>'). She works as a publicist.'<br></span></div><div><span style="font-family:monospace,monospace"><br></span></div><span style="font-family:arial,helvetica,sans-serif">Then <span style="font-family:monospace,monospace">hb.shape()</span> is called on each segment to get glyphs:</span><span style="font-family:monospace,monospace"><br><br></span></div><span style="font-family:monospace,monospace"> → [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p] [r] [i] [m] [a] [r] [y] [ ] [o] [ffi] [c] [e] [ ] [i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l] [l] [e] [ ] [(]<br><br></span></div><span style="font-family:monospace,monospace"> [T] [X] [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←<br><br></span></div><span style="font-family:monospace,monospace"> → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [ ] [p] [u] [b] [l] [i] [c] [i] [st] [.]<br><br></span></div><span style="font-family:monospace,monospace"><font face="arial,helvetica,sans-serif">My first line is 50 points wide, and I find that the first glyph that exceeds that is the </font>[e]</span><span style="font-family:arial,helvetica,sans-serif"> at the end of </span><span style="font-family:monospace,monospace"></span><br><span style="font-family:monospace,monospace"></span><span style="font-family:monospace,monospace"></span><div><br><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"> → [T] [r] [e] [e]</span></span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"> [ ] [P] [a] [i] [n] [e]</span> [’] [s] [ ] [p] [r] [i] [m] [a] [r] [y] [ ] [o] [ffi] [c] [e]</span><br><br></span></div><div><span style="font-family:monospace,monospace"><font face="arial,helvetica,sans-serif">I know from clustering that [e] corresponds to index 26 in the original string. So I look for the breakpoi</font></span><span style="font-family:arial,helvetica,sans-serif">nt with the highest index less than 26, i.e. one that’s within the string </span><span style="font-family:monospace,monospace"></span><span style="font-family:monospace,monospace">'Tree’s primary office'</span><span style="font-family:arial,helvetica,sans-serif">.</span><span style="font-family:monospace,monospace"><br></span></div><div><span style="font-family:arial,helvetica,sans-serif">That breakpoint is the hyphenation breakpoint at index 22 (</span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace">'Tree Paine’s primary of</span>' + '-'</span><span style="font-family:arial,helvetica,sans-serif">)</span><span style="font-family:monospace,monospace"><br><br></span></div><div><span style="font-family:arial,helvetica,sans-serif">Then I reshape the string </span><span style="font-family:arial,helvetica,sans-serif"><span style="font-family:monospace,monospace">'Tree Paine’s primary of-'</span>, using safe-to-break to reuse old glyphs when possible, to get the line<br><br></span><span style="font-family:monospace,monospace"> <line 1> : { [T] [r] [e] [e] </span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace">[ ] [P] [a] [i] [n] [e]</span> [’] [s] [ ] [p] [r] [i] [m] [a] [r] [y] [ ] [o] [f] [-] }<br><br></span></div><div><span style="font-family:arial,helvetica,sans-serif">By the same process, I use the remaining string to get the remaining glyphs (note how the ligature <span style="font-family:monospace,monospace">[ffi]</span> got broken into <span style="font-family:monospace,monospace">[f]</span> and <span style="font-family:monospace,monospace">[fi]</span>)</span><span style="font-family:monospace,monospace"><br><br></span><span style="font-family:monospace,monospace"> → [fi] [c] [e] [ ]
[i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l] [l] [e] [ ] [(]<br><br></span><span style="font-family:monospace,monospace"> [T] [X] [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←<br><br></span><span style="font-family:monospace,monospace"> → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [ ] [p] [u] [b] [l] [i] [c] [i] [st] [.]<br></span><br></div><div>I build my second line starting from the <span style="font-family:monospace,monospace">[fi]</span> glyph, but I find that the last glyph in this run, the <span style="font-family:monospace,monospace">[(]</span> , still leaves extra space at the end of the line. The line<br><br><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace"> → </span>[fi] [c] [e] [ ]
[i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l] [l] [e] [ ] [(]<br><br></span></div><div><span style="font-family:monospace,monospace"><font face="arial,helvetica,sans-serif">is only 38 points long, leaving 12 points of space. So I go on to the next run, the RTL segment</font></span><span style="font-family:arial,helvetica,sans-serif">. I find that 12 points of space is only enough to fit the glyphs<br></span><br><span style="font-family:monospace,monospace"> [E] [T] [ ] [D] [E] [S] [R] [E] [V] [E] [R] ←</span></div><div><br></div><div>Again, clustering tells me this corresponds to index 55 in the original string (<span style="font-family:monospace,monospace">'Tree </span><span style="font-family:monospace,monospace"><span style="font-family:monospace,monospace">Paine</span>’s primary office is in Nashville (REVERSED TE'</span><span style="font-family:arial,helvetica,sans-serif">), and the last breakpoint less than index 55 is the whitespace breakpoint at index 53. So I shape the RTL string <span style="font-family:monospace,monospace">'REVERSED '</span> and add it to the </span><span style="font-family:arial,helvetica,sans-serif"><span style="font-family:arial,helvetica,sans-serif"><span style="font-family:monospace,monospace">'fice is in Nashville ('</span></span> I had from before to get line 2:<br><br></span></div><div><span style="font-family:arial,helvetica,sans-serif"><span style="font-family:monospace,monospace"> <line 2></span> : { </span><span style="font-family:arial,helvetica,sans-serif"><span style="font-family:monospace,monospace">[fi] [c] [e] [ ]
[i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l] [l] [e] [ ] [(] } { </span></span><span style="font-family:arial,helvetica,sans-serif"><span style="font-family:monospace,monospace">[D] [E] [S] [R] [E] [V] [E] [R] } { [ ] }<br><br></span></span></div><div><span style="font-family:arial,helvetica,sans-serif">(The braces represent reversals in text direction) Then, the remainder, shaped from the string <span style="font-family:monospace,monospace">'</span></span><span style="font-family:monospace,monospace">TEXT', '). She works as a publicist.'</span><span style="font-family:arial,helvetica,sans-serif"> leaves</span><span style="font-family:monospace,monospace"></span><span style="font-family:arial,helvetica,sans-serif"><br></span><br><span style="font-family:monospace,monospace"> [T] [X] [E] [T] ←<br><br></span><span style="font-family:monospace,monospace"> → [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [ ] [p] [u] [b] [l] [i] [c] [i] [st] [.]<br><br></span></div><div><span style="font-family:arial,helvetica,sans-serif">The glyphs </span><span style="font-family:monospace,monospace">[T] [X] [E] [T]</span><span style="font-family:arial,helvetica,sans-serif"> take up just 7 points of space, leaving 43 points to fit the last LTR run. So in the last run, I find that the <span style="font-family:monospace,monospace">[i]</span> is the first glyph to overrun the 43 points of space, which leads us to divide the string into <span style="font-family:monospace,monospace">'). She works as a pub-'</span>, and <span style="font-family:monospace,monospace">'licist.'</span>, and shape the two accordingly. (Remember that at no time did we literally break at the glyphs <span style="font-family:monospace,monospace">[b]</span> and <span style="font-family:monospace,monospace">[l]</span>, we broke the original string using the cluster value of the <span style="font-family:monospace,monospace">[i]</span>.) The shaped glyphs created from the first string get added to the </span><span style="font-family:monospace,monospace">[T] [X] [E] [T]</span><span style="font-family:arial,helvetica,sans-serif"> glyphs we already had in the line, and the shaped glyphs from the second string go into line 4. So the end result is:</span><span style="font-family:monospace,monospace"><br><br><br> <line 1> : { [T] [r] [e] [e] [ ] [P] [a] [i] [n] [e] [’] [s] [ ] [p] [r] [i] [m] [a] [r] [y] [ ] [o] [f] [-] }<br><br> <line 2> : { [fi] [c] [e] [ ]
[i] [s] [ ] [i] [n] [ ] [N] [a] [s] [h] [v] [i] [l] [l] [e] [ ] [(] } { [D] [E] [S] [R] [E] [V] [E] [R] } { [ ] }<br><br> <line 3> : { [T] [X] [E] [T] } { [)] [.] [ ] [S] [h] [e] [ ] [w] [o] [r] [k] [s] [ ] [a] [s] [ ] [a] [ ] [p] [u] [b] [-] }<br><br></span></div><div><span style="font-family:monospace,monospace"> <line 4> : { [l] [i] [c] [i] [st] [.] }<br><br></span></div><div><span style="font-family:arial,helvetica,sans-serif">Does this make sense?</span><span style="font-family:monospace,monospace"><br></span></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 14, 2016 at 2:09 AM, <span dir="ltr"><<a href="mailto:kelvinsthirteen@gmail.com" target="_blank">kelvinsthirteen@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
<br>
> On Jun 14, 2016, at 1:29 AM, Simon Cozens <<a href="mailto:simon@simon-cozens.org">simon@simon-cozens.org</a>> wrote:<br>
><br>
>> On 14/06/2016 14:23, <a href="mailto:kelvinsthirteen@gmail.com">kelvinsthirteen@gmail.com</a> wrote:<br>
>> each word has at least<br>
>> one (often many) breakpoints, but only one of them gets used per<br>
>> line.<br>
><br>
> Right.<br>
><br>
>> And the only way to know which one to use is to shape.<br>
><br>
> Well, no. You shaped already; that was the first thing you did. As Adam<br>
> told you, the only way to know which breakpoint to *use* is to run a<br>
> justification algorithm, which you need to code. You're currently<br>
> thinking about a simple first-fit algorithm, which chops the glyphs into<br>
> lines once they get to be longer than the target line length; that's<br>
> fine, although you may find that a best-fit algorithm which performs<br>
> dynamic programming over all possible breakpoints gives you a neater<br>
> paragraph.<br>
<br>
</span>I think here lies the confusion between us, we’re talking about completely different layout systems. SILE uses algorithmic layout, which is fine and good for a *compiled* "here’s the text source, do your magic on it i don’t care" layout engine, but is bad for visual feedback because editing gets extremely chaotic when done live.<br>
<br>
Knockout’s emphasis is on predictability and a layout model that the user can understand—text hits a wall, the last word gets broken into pieces (hyphenation) and it broken-off chunk falls onto the next line. The text is never allowed to "exceed" the limits or "borrow space" like in quantum physics or something. And when you edit it shouldn't affect anything before the cursor. Because when you insert a letter "s" somewhere in your paragraph, you should be able to have a good idea in your head of what’s going to happen.<br>
<br>
Now with ligatures and contextual substitution it’s not 100% possible, there will be some small back-ripples, but the principle still applies. You shouldn’t scramble the whole paragraph because the algorithm changed its mind.<br>
<br>
Of course you can argue algorithmic layout is better than physical layout, but algorithmic layout’s advantage is really with compiled documents, and we already have a nice compiled typesetter—SILE. When you edit live, physical layout “pisses you off” less than algorithmic layout.<br>
<span class=""><br>
><br>
> Now, shaping determines the glyph widths for you (which is the input to<br>
> your line breaking algorithm), but it is your code which is responsible<br>
> for finding the *possible* breakpoints in the text, at the language<br>
> level, and your code which is responsible for determining the *actual*<br>
> breakpoints at the shaped-glyph level.<br>
><br>
> Here we go then. If you want to use Harfbuzz to shape lines into<br>
> paragraphs, here is what you need to do:<br>
><br>
> * Perform the first stage of the bidi algorithm to organise the text<br>
> into same-direction runs. (Really, don't leave this bit out, and don't<br>
> think "I'll add RTL/complex script support later", because that never<br>
> works out well and because we already have enough typesetters that only<br>
> handle Latin.) ICU does this.<br>
><br>
<br>
</span>I have almost zero knowledge of any scripts besides latin-greek-cyrillic other than that they are written "backwards" so you gonna have to explain how all this works to me & exactly what a bidi algorithm does.<br>
<span class=""><br>
> * Shape the paragraph, keeping note of the connection between the glyphs<br>
> and the text. Harfbuzz does this.<br>
><br>
> * Find the breakpoints in the text, using language analysis on the<br>
> characters. ICU does this.<br>
><br>
> * Create a data structure representing the runs of Harfbuzz output<br>
> between the breakpoints - TeX and SILE call these runs "nnodes" - and<br>
> the potential breakpoints themselves - "penalty nodes" (for breaking<br>
> inside a "word") and "glue nodes" (for whitespace between "words").<br>
> Assign widths to the nnodes by summing the widths of the shaped glyphs<br>
> inside them. You can put each glyph into its own nnode instead of<br>
> consolidating each run into an nnode if it's conceptually easier, but it<br>
> just slows your justification algorithm down.<br>
<br>
</span>Again, algorithmic layout model. In the physical layout model, kerning with space glyphs matters, and justification is done by dividing up all the remaining space equally & padding each space glyph. You can’t just replace them with "glue nodes"<br>
<span class=""><br>
><br>
> Here's what my data structure looks like at this stage:<br>
><br>
> N<19.71pt>(Take)G<2.6pt>N<22.06pt>(these)G<2.6pt>N<15.37pt>(five)G<2.6pt>N<40.42pt>(sentences)G<2.6pt>N<25.17pt>(which)G<2.6pt>N<2.97pt>(I)G<2.6pt>N<19.95pt>(need)G<2.6pt>N<8.47pt>(to)G<2.6pt>N<23.24pt>(break)G<2.6pt>N<16.69pt>(into)G<2.6pt>N<4.58pt>(a)G<2.6pt>N<42.68pt>(paragraph)N<2.29pt>(.)<br>
><br>
> (Each nnode also contains a list of glyph IDs and widths.) Each of the<br>
> glue nodes are potential break points; these were obtained by checking<br>
> the Unicode line break status of each character. The space character<br>
> 0x20 is breakable, so it gets turned into a glue node.<br>
><br>
> * Run your justification algorithm to determine which breakpoints should<br>
> be used. Your code does this.<br>
><br>
> * If the algorithm does not produce a tight enough paragraph, break open<br>
> the nnodes by hyphenating the text, reshaping them into new nnodes, and<br>
> putting a discretionary breakpoint in the middle.<br>
><br>
> Now it looks like this:<br>
><br>
> N<19.71pt>(Take)G<2.64pt>N<22.06pt>(these)G<2.64pt>N<15.37pt>(five)G<2.64pt>N<13.99pt>(sen)D(N<3.36pt>(-)||)N<26.43pt>(tences)G<2.64pt>N<25.17pt>(which)G<2.64pt>N<2.97pt>(I)G<2.64pt>N<19.95pt>(need)G<2.64pt>N<8.47pt>(to)G<2.64pt>N<23.24pt>(break)G<2.64pt>N<16.69pt>(into)G<2.64pt>N<4.58pt>(a)G<2.64pt>N<18.43pt>(para)D(N<3.36pt>(-)||)N<24.24pt>(graph)N<2.29pt>(.)<br>
<br>
</span>You are inserting a hyphen glyph. You have to reshape and retest for overflow. (actually you’d have to do this no matter what because if you’re considering breaking the line, you'd have to reshape the broken text in two separate pieces instead of as one string)<br>
<span class=""><br>
><br>
> * Run your justification algorithm again on this new node list.<br>
><br>
> On a 100pt column, my algorithm determined that the line breaks are at<br>
> position 10 and position 22 of the node list array.<br>
<br>
</span>Hyphens only exist at linebreaks. So how did you know to hyphenate the words at position 10 and 22 before you knew the linebreaks? Or did you insert *all* the possible hyphens and then remove all but those two? Either way you have to reshape, and if you reshape, your line break computations aren’t valid anymore and we are back to where we started.<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> * Organise your node list into a list of lines, based on the breakpoints<br>
> that were fired.<br>
><br>
> I split my node list at positions 10 and 22, so my lines are:<br>
><br>
> N<19.71pt>(Take)G<2.64pt>N<22.06pt>(these)G<2.64pt>N<15.37pt>(five)G<2.64pt>N<13.99pt>(sen)D(N<3.36pt>(-)||)N<26.43pt>(tences)G<2.6pt><br>
><br>
> N<25.17pt>(which)G<2.6pt>N<2.97pt>(I)G<2.6pt>N<19.95pt>(need)G<2.6pt>N<8.47pt>(to)G<2.6pt>N<23.24pt>(break)G<2.6pt>N<16.69pt>(into)<br>
><br>
> N<4.58pt>(a)G<2.6pt>N<18.43pt>(para)D(N<3.36pt>(-)||)N<24.24pt>(graph)N<2.29pt>(.)<br>
><br>
> * For each line in the paragraph, apply the second part of the bidi<br>
> algorithm (ICU does this) and reshape where necessary. This splits and<br>
> recombines ligatures correctly. (I promise; we have a test case to prove<br>
> this.)<br>
><br>
> You only need to determine line breaks once, and you only need to<br>
> reshape once per line maximum. I'm not going to argue about whether it<br>
> works or not, because you can check out the code and the test suite for<br>
> yourself: <a href="https://github.com/simoncozens/sile" rel="noreferrer" target="_blank">https://github.com/simoncozens/sile</a><br>
><br>
>> In fact I don’t see any other way to do it<br>
><br>
> You need to put aside the idea that there is a connection between<br>
> shaping and determining which breakpoints to use. There isn't one, and<br>
> this is the mental block which is stopping you from seeing solutions to<br>
> your problem.<br>
><br>
> Simon<br>
</div></div></blockquote></div><br></div>