[Bidi] Is "The Problem Of Not Having Arabic RLM" a real problem?

Beni Cherniavsky cben at users.sf.net
Sat Aug 7 14:21:39 PDT 2004

One thing that the Unicode Bidi Algorithm (let's call it UBA, OK?) got
provably illogical is the handling of numbers as European vs. Arabic at
the start of a line.  Number separators/terminators handling differs for
the two kinds of numbers.  Assuming this handling is correct for each
kind, it is desirable to have ASCII numbers correctly categorised as
European/Arabic depending on context (resorting to Arabic Unicode digits
is not convenient enough I guess).  The UBA does this by rule W2__:

     W2. Search backwards from each instance of a European number until
         the first strong type (R, L, AL, or `sor`) is found.  If an AL
         is found, change the type of the European number to Arabic

     __ http://www.unicode.org/reports/tr9/#W2

The problem with that is that number at the start of a paragraph are
always treated as European and *there is no way to override it*.  `sor`
can never be AL and there is no "Arabic RLM" (an invisible char with AL
category).  The sub-optimal guessing is pardonable (though regrettable)
but lacking a way to override it is clearly wrong!  (Actually you can
override with LRMs around the number itself.  But that's a round-about
way to do it.  It doesn't allow a higher-level knowledge that the text
is Arabic to be expressed by stuffing ARLMs at starts of paragraphs and
RTL embeddings.

I learnt about this problem from http://www.yudit.org/bidi/surprise.html
which contains an example.

Now my question to Arabic-speaking people (for me this problem is purely 
theoretical):  Is this a problem in real life?  Would you like it fixed?

The way to fix the UBA is quite obvious I think:

1. Add an Arabic RLM ("ARLM"?) char to Unicode.  This is enough to make
    it overridable.

2. Allow `sor`/`eor` to take an AL value in addition to the current
    R and L.  This value could appear by two ways:

    A. The guessing of paragraph direction by first strong character
       would assing AL to `sor` if the first such character is AL.  This
       is important because it would fix 99% of the problem implicitly.

    B. Add an ARLE embedding code that is like RLE but sets `sor` to AL.
       (As I argued a couple of weeks ago, it's better for backward
       compatibility if it also comes with a new terminating code instead
       of reusing PDF).  This is probably an overkill as an RLE followed
       by ARLM achieves the same effect.  (I think the same of the
       RLE/LRE distinction but that's another story ;-).

If you think this is worth the effort, I leave it to you to lobby for
fixing the UBA.  I'm asking because if it is a good idea, I will fix it 
from the beginning in the "Hierarchically Implicit Bidi" scheme I'm
designing (I can use approach 2.A to fix it implicitly at all levels).

Beni Cherniavsky <cben at users.sf.net>
Note: I can only read email on week-ends...

More information about the bidi mailing list