[HarfBuzz] Harfbuzz with linebreaking

Kelvin Ma kelvinsthirteen at gmail.com
Wed Jun 15 16:12:15 UTC 2016


[re-sending bc idk why the indents went away]

@ martin

I don't see how the final space needs to be in its own run here. It's part
> of a single direction RTL run and can stay part of it. There is no need to
> rerun any bidi at this stage of the proceedings. Having said that, space
> generally needs special handling at the end of a line (in effect, cut it
> out, it isn't part of the line being broken or part of the following line
> either). This is true whether you had bidi going on or not. And there are
> lots of spaces in Unicode, as I'm sure you are aware.
>

I always kept the space in the glyph list because I use it to generate
cursor positions, and you need the space to make the last cursor position

Yes. Notice that you only had to reshape twice per line. In the bad case
> that inserting a hyphen made the shaping result longer than a line, then
> you would need to back up and try again, which is in effect, the cost of
> another line. The costly bit is if you have a long paragraph, the reshaping
> of the 'rest of the paragraph' for each line is costly.



yeah, that's what the problem with this system is, in terms of that article
Alexander sent, it’s O(1/2 * n^2). It’d be fine if I could reuse the shaped
glyphs, but without safe-to-break you don't know when how far out you have
to recalculate.

I would suggest that you don't need to reshape if the start of the next
> line is in a different cluster to the end of the previous line. There are
> cases where you may need to do some positional tidying (deciding where the
> new 0 >is in the line), but you can't ligate across a cluster boundary (by
> definition in OT). Equally, you should be able to save reshaping for the
> end of a line if there is no text added and you break on a cluster
> boundary. These are >important optimisations (which I will probably get
> yelled at for suggesting, but it would be interesting to hear the use cases
> where my presuppositions fall down), because you really don't want to have
> to reshape a long paragraph n times, especially when most of the time you
> will break at a space.
>

I don’t think this works because of contextual substitution. Which is very,
very common in cursive fonts. If you separate a cursive pair, you have to
change both glyphs to their separate forms. I’ve also created “ordinary” text
serif fonts <https://github.com/kelvin13/noctilucenta> (like Noctilucenta)
which use chained contextual substitutions that make it so a glyph’s form
can be controlled by the presence of a character hundreds of indices before
it. (This is done to provide access to certain glyph sets like old style
numerals or small caps). That’s why i'm kind of up your case about
safe-to-break, because that would allow us to detect a crazy font like that
and avoid having to reshape out that far if it’s not needed.

Of course this all presumes you have a supporting engine that tells you
> line break opportunities for all the languages of the world, including
> hyphenation dictionaries. ICU may be sufficient for your needs, but I do
> encourage you, and everyone, to allow the addition of extra languages to
> your application beyond those you compile for.
>

Right now it doesn’t precalculate line break opportunities, it just checks
to see if it didn’t overrun on a whitespace character, and if so it’s
extracts that one word and runs it through the hyphenation dictionary.
Probably won’t work for languages that don’t hyphenate words (or languages
with no spaces) but it avoids having to run *every* word in the paragraph
through the hyphenation engine which can take a while.

Opening it up to every language would probably be as simple as creating a
way for a foreign breakpoint engine to access the original text string and
supplying the cluster index of the glyph that overran the line limit. And
probably a path backwards for it to return a separator character, like a
hyphen ('-').

I notice you say you want a very clear, to the user, line breaking
> algorithm and so are going purely line by line, earliest break first. I
> would suggest that for greatest clarity that you not do hyphenation. All
> systems try to avoid hyphenation unless they have to (can't find a break
> within a certain distance of the end of line), otherwise you may find you
> are hyphenating every line. I would give the user the option of turning
> hyphenation on and off and giving a hyphenation zone (or maximum
> raggedness). This doesn't impinge on your single line breaking algorithm,
> it just tries to reduce the likelihood of hyphens turning up. And, as you
> have shown, hyphenation is costly in terms of reshaping.
>

Hyphenation is a styling attribute in Knockout. It’s off by default and can
be activated on a classed or paragraph-by-paragraph basis. But I’ve found
it’s fine to turn it on for every paragraph because hyphenation only occurs
when it's possible to hyphenate (the part of the word before the line limit
is long enough to contain a hyphenation point), and when hyphenation is not
needed, it’s generally impossible anyway. The only issue I ever had was
sometimes the hyphenator would get confused and hyphenate on an '-’s' or a '
-ly' which looks bad but that’s probably a problem with the dictionary.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/harfbuzz/attachments/20160615/1e521a79/attachment.html>


More information about the HarfBuzz mailing list