[HarfBuzz] Fwd: Harfbuzz with linebreaking

Tue Jun 14 03:19:38 UTC 2016

** I will have to shape the entire paragraph * (not I will have to shape
the entire sentence)

On Mon, Jun 13, 2016 at 11:16 PM, Kelvin Ma <kelvinsthirteen at gmail.com>
wrote:

>
> On Mon, Jun 13, 2016 at 10:53 PM, Simon Cozens <simon at simon-cozens.org>
> wrote:
>
>> On 14/06/2016 12:42, Kelvin Ma wrote:
>> > What I need is something to bridge that gap between the 1-line of
>> > unbroken text that harfbuzz generates, and the fragments I need to be
>> > able to assemble a multi-line paragraph.
>>
>> Right. You need that, but it's not Harfbuzz's job. Write some code. :-)
>>
>> > The only way to get these
>> > pieces is to find the spots in the shaped text where the whole line can
>> > be shaped in two pieces with an identical result.
>>
>> Wrong. What you need to find is the potential line breaks. That's not a
>> shaping issue specifically; it's a text issue, and needs to be dealt
>> with at the text level.
>
>
> No, this is also a shaping issue and i’ll explain why.
>
> *Take these five sentences which I need to break into a paragraph. The
> shaper is always going to be involved in this. Did you only count two?*
>
> It has potential breakpoints here:
>
> |Take |these |five |sen-|ten|-ces |which |I |need |to |break |into |a
> |para-|graph. |The |sha-|per |is |al-|ways |go-|ing |to |be |in-|vol-|ved
> |in |this. |Did |you |on-|ly |count |two?|
>
> The problem is, I have no idea where, in terms of x-coordinate, any of
> these breakpoints are going to be until I shape them. So I will have to
> shape the entire sentence.
>
> Then I find that the first glyph that overruns the width of the line is
> the ‘e’ in “sentences”:
>
> *Take these five sente*
>
> Now I know that I can cut this down to a correct line break by just
> shaping the text “*Take these five sen-*” and testing to see if that fits
> (with the “safe-to-break” thing, I can probably just keep the old “Take
> these five se” glyphs and append a newly shaped “n-”.)
>
> The problem comes with what to do with the text that comes after the
> breakpoint. Without “safe-to-break” I have to reshape the *entire*
> remainder of the paragraph, the whole text “*tences which I need to break
> into a paragraph. The shaper is always going to be involved in this. Did
> you only count two?*”. If the paragraph is long, this can be a very long
> string. If I had the “safe-to-break” thing, I could find that I could keep
> that portion of the originally shaped line, or at worst, maybe have to
> reshape a “te” or something and append the old “*nces which I need to
> break into a paragraph. The shaper is always going to be involved in this.
> Did you only count two?*” to it.
>
> The amount of text that has to be laid out is the entire length of the
> paragraph, PLUS *half the entire length of the paragraph times the number
> of lines*. That last part is crucial. With “safe-to-break” it’s just the
> length of the paragraph, plus a few bits and pieces of fractured text here
> and there.
>
>
>> Taking the example of a ligature, it *is* allowable to break (with
>> hyphenation) in the middle of a ligature like "fi". Indeed, your
>> justification engine might decide, for the good of the rest of the lines
>> in the paragraph, that this is the best place to break. If all you are
>> dealing with is the glyph output from Harfbuzz, you won't be able to
>> spot that breakpoint.
>>
>> Once you get into non-Latin scripts, things get worse. Finding
>> breakpoints is a matter that depends entirely on the rules of the script
>> or language that your text is written in. Right now I'm fighting with
>> Javanese, where line breaks are permissible at the end of syllables. You
>> need to parse the text, not the glyphs, to determine the appropriate
>> breaks. Like others have said: use ICU or similar.
>>
>> And so you need to deal with two sets of information at the same time:
>> the text-level information about breaks, and the shaper-level
>> information about glyphs. This is why Harfbuzz returns you an index into
>> your text string, so that you can keep those two sets of information in
>> sync. The hard part of writing a typesetting system is dealing with the
>> interplay between those two representations of a text.
>>
>
> You are right. But I hope I explained why the shaping information has to
> come before the textual-breakpoint information, because without shaping,
> you don’t know *where* the breakpoints lie, and if you don’t know where
> they lie, they don’t function as breakpoints anymore.
>
>
>>
>> It took me quite a while to get my head around this, and a lot of help
>> from others. You can see the record of me banging my head against this
>> particular wall at https://github.com/simoncozens/sile/issues/179 ,
>> which has a nice explanation of the issues involved.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20160613/c6265e08/attachment-0001.html>