[Fribidi-discuss] Bug in wrapping in command-line fribidi?

Beni Cherniavsky cben at techunix.technion.ac.il
Mon Mar 24 13:41:14 EST 2003


Nadav Har'El wrote on 2003-03-24:

> On Mon, Mar 24, 2003, Beni Cherniavsky wrote about "Re: [Fribidi-discuss] Bug in wrapping in command-line fribidi?":
> > windows: 1, 5
> >...
> > bidiv: 3.B.1
> >...
> > ability to switch to 3.5.  However 3.B is really the most useful.  In
> >...
> > Would anybody object to declaring 3.B the Right Thing to do?  So that
> >...
>
> [snip]
>
> All in all, I obviously agreed with your conclusion of the "Right Thing
> To do" :)
>
> But note all these heuristics were designed for plain text that does not
> contain special paragraph seperators, explicit direction indicators, etc. -
> because if all these are present (e.g., if unicode editors start put them
> in the text, liberally, to make sure the output looks like the person
> intended) there probably isn't a need for such heuristics at all...
>
>
> > many cases.  The final fallback (1) doesn't really matter here, one
> > LRM per document is easy enough...
>
> :)
> I wanted bidiv to do something deterministic on a file that contains only,
> say, a bunch of numbers. Before I remembered this case, bidiv actually
> defaulted to Hebrew in this case, which made the numbers right-aligned,
> which was pretty weird.
>
That would include ascii-only files, so they should naturally default
LTR.  Especially if you plan to deploy (oh I hate this word ;)
some application with bidi support enabled by default in for all the
world, it's very important not to surprise people who never saw RTL in
their life...

> Bidiv is supposed to work on plain ISO-8859-8 text files, and even ASCII,
> so demanding LRM (which only exists in unicode) was out of the question.
>
8859 die die die, UTF-8 rruulleess :-)  [Random insight: to
                                         right-align a smiley, use the
                                         Hebrew Makaf instead of the
                                         minus.]

> > There can also be classification of algorithms for discerning
> > paragraph boundaries.  Beside the besic question of how to treat \n
> > line breaks, there are numerous subtleties, mostly dealing with
> > indentation.  I need to sleep so I won't devise a classification of
> > them now, do ``info fmt`` and ``M-x describe-variable
> > paragraph-start`` :-)...
>
> Yes, I guess bidiv can be made a little smarter on this, rather than
> treating \n\n as the only paragraph-break marker.
>
> But the problem is, the smarter you make you heuristics, the less chance
> is that they will do what the author of the text file expected, assuming
> that that author does not use the same program as you to view the file
> he created.
>
Indeed.  The point is that there really *is* such a thing as "unicode
plain-text in logical order"; at least among unix-minded users it's a
very natural idea to just emit/write mixed text in logical order
without caring for the bidi at all - just assume the reader has a
direct nerve transporting logical-order text into his brain.  True,
now you need an AI to render it perfectly.  So what?  You always
needed an AI (Emacs) to edit and highlight it perfectly anyway.

My idea was to keep this AI outside bidiv and most utilities.  At most
they would have a toggle between \n and \n\n, if at all.  For anything
more complicated it would expect the input to contain unambiguos
unicode PS and LS marks.  So you would do ``cat file | para-ai --email
--quote-regexp ... --some-more-tuning | bidiv``, if these marks are
not there yet.

> One thing that does annoy me (but I still haven't done anything
> about) is emails containing quotes like:
>
> 	In January 1, 2003, John Doe wrote:
>         > SOME HEBREW STUFF
>         > MORE HEBREW STUFF
>
> Where the attribution string is in one language, but the actual
> quoted text is in a different language. One can argue (and I do!)
> that it is wrong of a person to add an attribution string with a
> different language from the rest of the email, but many people still
> do.

Otherwise, they would need to add different attributions for parts of
an email by the same person.  That makes little sense (?).  And after
all, a person doesn't add attributions, his MUA does.

> A possible solution for a more email-tuned bidiv could be to
> declare a point of change in the number of ">" marks in the
> beginning of a line as a paragraph break. Obviously, this extra
> heuristic will look very arbitrary and non-general...
>
That's obviously the right thing for email.  Perhaps one day LS / US
will be commonplace in email...  For now, the minimal work approach
seems to be a single (yeah, right ;) utility for guessing the
paragraph boiundaries.

-- 
Beni Cherniavsky <cben at tx.technion.ac.il>,
whose 12x CD burner works at 24x with cdrecord in linux - sheer magic!




More information about the FriBidi mailing list