[Fribidi-discuss] Bug in wrapping in command-line fribidi?

Beni Cherniavsky cben at techunix.technion.ac.il
Sun Mar 23 18:13:55 EST 2003


[Should I cross-mail to ivrix-discuss?  It's not fribidi-specific any
longer - but all people interested in bidi seem to be on
fribidi-discuss anyway :-]

Nadav Har'El wrote on 2003-03-23:

> On Sun, Mar 23, 2003, Beni Cherniavsky wrote about
> "[Fribidi-discuss] Bug in wrapping in command-line fribidi?":
> > The --ltr / --rtl are needed to force the base direction of the whole
> > paragraph to be the same, otherwise different parts of the same line
> > would be auto-detected with different directions.
> >
> > A quick look at the source seems to confirm that wrapping is done on
> > the visual result.  The fix would be a bit involved.  In particular,
>
> The "bidiv" utility I wrote (which is basically a small C utility using
> the fribidi library) does both of the things you wanted: it correctly
> folds lines, and it determines the main direction by paragraph (but you
> also have an option to do it by logical line). It also has a few other
> features that you might like when viewing Hebrew (or Hebrew/English)
> text, like right justification and automatic ISO-8859-8 / UTF-8 recognition.
>
Yeah, I know and I use it.  In any case it doesn't make sense for the
fribidi test program to behave stupidly on rtl text :-).

Bidiv is almost perfect but fribidi is more flexible (more options
:-), in particular capRTL is handy and the various base direction
detection modes.  OTOH, bidiv has the handy ability to give paragraph
that contain no strong characters at all the direction of the previous
paragraph - and two modes for paragraph detection.


The Geresh editor inspired me to formulate the following symmetric
scale of base direction detection algorithms.  Only 1 and 5 always
give a decisive answer; all others to be precise must be followed by
some fallback algorithm.  This will be written as a dotted sequence,
e.g. 3.1.

The numbering has the property that any paragraph classified as RTL by
algorithm X will be classified so also by any algorithm Y > X (and
vice versa), where the partial orderings 1 < 2 < 3 < 4 < 5, 1 < B < 5
and 1 < F < 5 hold and fallback sequences are compared
lexicographically.

1. The paragraph is LTR.

2. If the paragraph contains any strong LTR character, it's LTR.

3. The first strong character determines the direction (if the
   paragraph contains no strong characters, the direction is
   undecided).

4. If the paragraph contains any strong RTL character, it's RTL.

5. The paragraph is RTL.

B. The paragraph has same direction as the previous paragraph.

F. The paragraph has same direction as the following paragraph.

There are other algorithms possible; if you have seen any other
actually implemented, please say, I'd like to include it.  (I'm only
including here algorithms for determining base paragraph direction,
that otherwise do implicit bidi according to the Unicode algorithm;
all the "poor man's bidi" hacks are out of the scope).

Here is an (incomplete) list of the algorithm combinations used (more
than one option means that the program allows to manually switch
between them):

all-too-many dumb programs: 1
rare pro-israeli but equally dumb programs ;-): 5
windows: 1, 5
Gnome: 1
Unicode UAX 9, KDE 3, mlterm: 3.1
Yudit: 3.1
  Yudit has a philosophy of explicit bidi, it convert any input text
  to explicit nested RLE/RLO/LRE/LRO-PDF spans.
fribidi (command-line): 1, 5, 3.1, 3.5
  A program using libfribidi can override or course, by requesting 1
  or 5 accroding to its own logic.  Here is an example:
bidiv: 3.B.1
Geresh: 4.2.B.F.1 (default), 1, 5, 3.1

4 and 2 have the benefit that they are less fragile than unicode's 3;
their serious drawback is that they can't be overriden in both ways
(e.g. Geresh's manual gives the tip that you can right-align an
English paragraph with LRM; it doesn't mention that under the 4...
algorithm you can't left-align a pragraph with Hebrew in it).

Note that Geresh's B.F.1 part effectively means that the "document's
initial direction" is determined by the first strongly directional
paragraph.  Sounds familiar, doesn't it :-).  Note the it's not a
complete analogy, following strongly directional paragraphs override
this.

3 is the most useful group and also the one that unicode suggestss
(particularly 3.1).  3.1 is most frequenlty implemented, perhaps with
ability to switch to 3.5.  However 3.B is really the most useful.  In
fact I think it is always superior to 3.1 and 3.5 and the only excuse
for falling back on them is not having access to the previous
paragraph or cases where strict paragraph independence is required.

Would anybody object to declaring 3.B the Right Thing to do?  So that
one can underline a Hebrew heading with a line of minuses without
using LRM and without a sense of guilt :-).  The look-ahead part (F)
should not be required because it's too inconvenient to implement in
many cases.  The final fallback (1) doesn't really matter here, one
LRM per document is easy enough...


There can also be classification of algorithms for discerning
paragraph boundaries.  Beside the besic question of how to treat \n
line breaks, there are numerous subtleties, mostly dealing with
indentation.  I need to sleep so I won't devise a classification of
them now, do ``info fmt`` and ``M-x describe-variable
paragraph-start`` :-)...

But I think I have a nice plan to tackle these: all utilities like
`fmt` that depend on paragraph boundaries will become dumb and only
understand PS (the unicode paragraph separator).  Separately from
them, a single (keep dreaming ;) utility will incorporate all
algorithms for finding paragraph boundaries and convert all \n on
input to PS/LS appropriately.  Some utilities that have "better
insight", like `fold` can be converted to emit PS/LS intelligentely
too.  Some backward-compatibility game will have to played of
course...  Comments?


BTW, this suggests a similar approach for base paragraph direction:
automatic conversion is possible between inputs to the various
algorithms (by inserting/removing various bidi marks, so as to
preserve the menaing of the text).  The problem that renders it pretty
useless is that you rarely know with which algorithm in mind a given
text was written until you look with your eyes at it...

-- 
Beni Cherniavsky <cben at tx.technion.ac.il>,
whose 12x CD burner works at 24x with cdrecord in linux - sheer magic!




More information about the FriBidi mailing list