[Fribidi-discuss] Bug in wrapping in command-line fribidi?

Tue Mar 25 06:00:17 EST 2003

Dov Grobgeld wrote on 2003-03-25:

> On Tue, Mar 25, 2003 at 12:24:11AM +0200, Beni Cherniavsky wrote:
>
> > For processing huge lines (possible use case: consider an editor
> > working on a logical line that is so long that it doesn't fit the
> > screen) we only need to split step 1.  Perhaps this can be solved by
> > exposing a state structure (probably a stack of embedding levels?)
> > that can be passed from one line segment to the next.  How much
> > lookahead do we need to resolve the levels?  I'm afraid this can't
> > work at least because we need to guess the base direction.
>
> I just finished writing such a patch for the gtk text widget. Please
> see:
>
>     http://bugzilla.gnome.org/show_bug.cgi?id=70451
>
> I'm indeed passing the base direction back and forth between the
> paragraph. Btw, in order to do this, I had to add a new function to
> fribidi, that returns the base direction of a paragraph.
>
Took a look but I'm quite gtk-illiterate :-).  Do you mean that you
process each line separately but pass the base direction info between
lines?

BTW, is there any need to consider incremental bidi computation (e.g.
we know only this part of the string changed)?  I think fribidi was
found to not be a bottleneck even for mlterm, which can have fast
updates.

> > Which reminds me, there should be a separate function exposed for just
> > auto-detecting the base direction.  This will allow the application to
> > intervene and supplement this auto-detection (currently I think it
> > sometimes involves re-running the whole log2vis process).  Custom
> > algorithms might benefit from a separate step that just detects the
> > character categories...
>
> See above.
>
> > In general I think as many separate processing stages should be
> > exposed as possible.  For example a program might understand some
> > "higher-level protocol" and wants to express this knowledge to
> > fribidi.  I think that can be done by supplying initial levels for the
> > characters (and fribidi can increase them according to implicit rules
> > and explicit marks...).
>
> It don't think it makes sense to carry the bidi level between paragraphs.
> But fribidi does already allow using a "high level protocol" for
> determining the base direction.

Here I meant "higher level protocols" for internal structure of the
text, e.g. the program knows that a given part of the text is
logically embedded...  So I thought it could provide initial levels
for every character - and then fribidi would be only increasing these
levels.  On a second thought, the program could just insert
LRE/RLE/PDF marks.

The problem is that frequently a program knows logical embedding
structure but has no idea whether each segment should be an RLE or
LRE.  How do such programs handle it anyway?  E.g. for a
paragraph with spans of nested text, deos mozilla apply bidi to
the whole paragraph or is this handled by gecko and only applies
implicit bidi to each innermost span?  I know mixing html with bidi
marks is not recommended - is it a good idea to try to support it?  I
think RLM/LRM should be supported but the explicit ones should not.

Use case: imagine a syntax-highlighting editor, e.g. emacs.  Imagine a
program in some language that contains utf-8 strings (there is trouble
if you express bidi marks or any other character in the string using
the language's \escapes; assume for now they are written as-is).  Now
arguably the most sensible thing is to treat all code as level 0 LTR,
while treating the strings (but not the "" string delimiters) as
embedded spans, whose direction is guessed from their content.  This
also allows english-in-hebrew inside the string to be showen properly.
Comments on lines that have code should probably be handled similarly,
while RTL comments that are alone on the line should make the whole
line right-aligned, probably including the comment delimiter (?).
For a final touch, RTL comments to end-of-line might be better off
right-aligned to the screen width.

$ cat PROG | fribidi -t --caprtl -w 80
// bidi-example - EXAMPLE CODE      => // bidi-example - EDOC ELPMAXE
/*                                  => /*
 * WRITTEN BY                       =>                       YB NETTIRW *
 * ME <email at example.com>           =>           <example.com at email> EM *
 *                                  =>  *
 * paragraphs in comments...        =>  * paragraphs in comments...
 */                                 =>  */
int main() // FUNCTION              => int main() // NOITCNUF
{                                   => {
    int x /* ALIGNMENT  */          =>     int x /* TNEMNGILA  */
     = 3; /* TEST       */          =>          /*       TSET */ ;3 =
                                    =>
    printf("ABC def GHI\n");        =>     printf("CBA def IHG\n");
                                    =>
    /* TEST */ if(1) {              =>              } (if(1 /* TSET */
        // indented                 =>         // indented
        return 0;                   =>         return 0;
        // INDENTED                 =>                 DETNEDNI //
    }                               =>     }
    // INDENTED LESS                =>                SSEL DETNEDNI //
}                                   => }

$ cat PROG-MARKED-UP | fribidi -t --caprtl -w 80
// bidi-example - EXAMPLE CODE      => // bidi-example - EDOC ELPMAXE
/*                                  => /*
 * WRITTEN BY                       =>                       YB NETTIRW *
 * ME <email at example.com>           =>           <example.com at email> EM *
 *                                  =>  *
 * paragraphs in comments...        =>  * paragraphs in comments...
 */                                 =>  */
_>int main() //_r FUNCTION_o        => int main() //NOITCNUF
_>{                                 => {
_>    int x /*_r ALIGNMENT  _o*/    =>     int x /*  TNEMNGILA */
_>     = 3; /*_r TEST       _o*/    =>      = 3; /*       TSET */
                                    =>
_>    printf("_rABC def GHI\n_o");  =>     printf("IHG def CBA\n");
                                    =>
_>    /*_r TEST _o*/ if(1) {        =>     /* TSET */ if(1) {
        // indented                 =>         // indented
_>        return 0;                 =>         return 0;
        // INDENTED                 =>                 DETNEDNI //
_>    }                             =>     }
    // INDENTED LESS                =>                SSEL DETNEDNI //
_>}                                 => }

As you see, using bidi marks to communicate the knowledge to fribidi
would work; the main inconvenience is that guessing the direction of
embedded spans must be known - a firbidi function to do just that
would be acceptable.

Pay special attention to the if(1) and = 3; lines.  Their base
direction was autodected as RTL - that's why I added LRMs (_>) on all
lines containing any code.  I'm starting to think unicode is wrong in
following the first character:
  http://www.unicode.org/unicode/reports/tr9/#P2

Rather I think that all content of RLE/LRE...PDF should be skipped
completely, while RLO/LRO...PDF should probably be taken as a strongly
directional character terminating the search.  This way the base
direction is guessed only according to the character at the top
embedding level.  In particular, the ``= 3; /* TEST */`` line would be
neutral (falling back to LTR, since we are in C mode).  This would
make automatic generation of bidi markers based on logical embedding
knowledge much easier.

Perhaps if unicode specified a single neutral embedding mark, with
direction derived implicitly at each embedding level (reliably
overridable by putting LRM/RLM right after the embedding start),
things would be easier.  Maybe fribidi should go ahead and interpret
some PUA character as an "Implicit Logical Embedding"; the specific
code should probably be a parameter defined by the program (defaulting
to none) to avoid clashes.  I'm not sure it's a good idea.

> [snip]
>       1. Find first strong character in whole text. That becomes
> 	   the bidi dir of the first paragraph.
> 	2. Loop over each paragraph:
> 	     1. Find bidi dir of paragraph. Input *pbase_dir is the
> 	        weak version of the current bidi dir.

Exactly what I was thinking.

>            2. Layout according to bidi dir.
>
> The result is that weak paragraphs inherit the direction of the previous
> paragraph.
>
> I actually think it is unfortunate that the Unicode bidi dir doesn't
> define this behaviour, but only speaks of a "higher level protocol".
>
They probably wanted complete independence between paragraphs but the
result is unfortunate inconvenience.  They are now updating TR9
(http://www.unicode.org/unicode/reports/ - there is a "Proposed
Update" version) - is there a chance to convince them to at least
recommend such behaviour for plain text?  Currently some would even
read using the previous paragraph's direction as non-conforming,
although we can always claim this to be a "higher level" knowledge...

> > Wait a moment!  The fribidi API was derived from a document describing
> > Mozilla's requirements, wasn't it?  Now mozilla does line splitting,
> > with varying fonts and perhaps kerning complications, higher-level
> > protocol (span dir=...), arabic shaping, etc. - a complete nightmare.
> > How do they do it?
>
> Not with fribidi... But really all you have to do what you said:
>
>   1. resolve levels
>   2. split lines
>   3. reorder all odd bidi levels
>
Indeed, just noticed that TR9 defineds just that (search for "Basic
Display Algorithm").

> > [On varying fonts: I think that in some corner cases the points of
> > line-wrapping can't be determined until you know the visual order,
> > because of font/char-dependent spacing at segment boundaries;
> > TeX-style global line breaking with hyphenation probably introduces
> > even more complications...  I wonder whether anybody ever tackled this
> > and whether fribidi should take this into consideration.  I guess any
> > programmer would give up such precision in return for his sanity. :-]
>
> Actually I think that the order above solves that. What makes things
> really insane is kerning at bidi directional boundaries. But that
> may safely be ignored I belive. 8-)
>
Yes, that's what I referred to - such kerning done precisely might(?)
influence the line splitting points which must be known before reach
the kerning stage - a cycle...  I agree that it's OK to just do it
imprecisely :-).

* * *

P.S. The current global state fribidi_set_mirroring(), etc. function
are a bad idea for a library; in the next API version there should be
some options struct that can be set up and passed to all functions.

-- 
Beni Cherniavsky <cben at tx.technion.ac.il>,
whose 12x CD burner works at 24x with cdrecord in linux - sheer magic!