[Fribidi-discuss] Bug in wrapping in command-line fribidi?

Nadav Har'El nyh at math.technion.ac.il
Mon Mar 24 13:04:03 EST 2003


On Mon, Mar 24, 2003, Beni Cherniavsky wrote about "Re: [Fribidi-discuss] Bug in wrapping in command-line fribidi?":
> windows: 1, 5
>...
> bidiv: 3.B.1
>...
> ability to switch to 3.5.  However 3.B is really the most useful.  In
>...
> Would anybody object to declaring 3.B the Right Thing to do?  So that
>...

When I wrote bidiv I decided to use this "3.B.1" (as you call it) by default
because I decided that it's the only option that makes sense for mixed
English-Hebrew text, where the most important type of text file that I was
thinking of was Email.

For email, the Windows "1,5" makes sense only if the email is either 100%
Hebrew or 100% English. In that situation, my "3.B.1" is equivalent to that
"1,5" (if I understand all your numbers and letter correctly :)).

But in the typical email I was seeing there was typically English headers
(in which the default direction should be LTR), Hebrew text (RTL), and
many times an English signature, mailing-list marker, etc. (LTR). Sometimes
you'd also have some English paragraphs in that email.

But choosing the base direction independently for each output line (bidiv -l)
wasn't satisfactory because in rare (but not very rare) you'd get an English
word accidentally starting a line in the middle of a paragraph, and that
line would come out wrong. That is when I decided that I should choose the
direction per paragraph (where a paragraph is determined by an empty line,
like is common practice in plain text files). And when I started getting
annoying cases of seperation lines, number headings, etc., getting in the
wrong side, I decided that if a new paragraph is of yet unkown direction,
the most sensible thing to do is to use the previous paragraph's direction,
hoping that the text continued in the same language it had previously.
All these heuristics come out your "3.B.1", I guess.

All in all, I obviously agreed with your conclusion of the "Right Thing
To do" :)

But note all these heuristics were designed for plain text that does not
contain special paragraph seperators, explicit direction indicators, etc. -
because if all these are present (e.g., if unicode editors start put them
in the text, liberally, to make sure the output looks like the person
intended) there probably isn't a need for such heuristics at all...


> many cases.  The final fallback (1) doesn't really matter here, one
> LRM per document is easy enough...

:)
I wanted bidiv to do something deterministic on a file that contains only,
say, a bunch of numbers. Before I remembered this case, bidiv actually
defaulted to Hebrew in this case, which made the numbers right-aligned,
which was pretty weird.

Bidiv is supposed to work on plain ISO-8859-8 text files, and even ASCII,
so demanding LRM (which only exists in unicode) was out of the question.

> There can also be classification of algorithms for discerning
> paragraph boundaries.  Beside the besic question of how to treat \n
> line breaks, there are numerous subtleties, mostly dealing with
> indentation.  I need to sleep so I won't devise a classification of
> them now, do ``info fmt`` and ``M-x describe-variable
> paragraph-start`` :-)...

Yes, I guess bidiv can be made a little smarter on this, rather than
treating \n\n as the only paragraph-break marker.

But the problem is, the smarter you make you heuristics, the less chance
is that they will do what the author of the text file expected, assuming
that that author does not use the same program as you to view the file
he created.


One thing that does annoy me (but I still haven't done anything about) is
emails containing quotes like:

	In January 1, 2003, John Doe wrote:
        > SOME HEBREW STUFF
        > MORE HEBREW STUFF

Where the attribution string is in one language, but the actual quoted text
is in a different language. One can argue (and I do!) that it is wrong of a
person to add an attribution string with a different language from the rest
of the email, but many people still do. A possible solution for a more
email-tuned bidiv could be to declare a point of change in the number of
">" marks in the beginning of a line as a paragraph break. Obviously,
this extra heuristic will look very arbitrary and non-general...

-- 
Nadav Har'El                        |     Monday, Mar 24 2003, 21 Adar II 5763
nyh at math.technion.ac.il             |-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |"I don't use drugs, my dreams are
http://nadav.harel.org.il           |frightening enough." -- M. C. Escher




More information about the FriBidi mailing list