[Bidi] Is "The Problem Of Not Having Arabic RLM" a real problem?
Beni Cherniavsky
cben at users.sf.net
Sat Aug 7 14:21:39 PDT 2004
One thing that the Unicode Bidi Algorithm (let's call it UBA, OK?) got
provably illogical is the handling of numbers as European vs. Arabic at
the start of a line. Number separators/terminators handling differs for
the two kinds of numbers. Assuming this handling is correct for each
kind, it is desirable to have ASCII numbers correctly categorised as
European/Arabic depending on context (resorting to Arabic Unicode digits
is not convenient enough I guess). The UBA does this by rule W2__:
W2. Search backwards from each instance of a European number until
the first strong type (R, L, AL, or `sor`) is found. If an AL
is found, change the type of the European number to Arabic
number.
__ http://www.unicode.org/reports/tr9/#W2
The problem with that is that number at the start of a paragraph are
always treated as European and *there is no way to override it*. `sor`
can never be AL and there is no "Arabic RLM" (an invisible char with AL
category). The sub-optimal guessing is pardonable (though regrettable)
but lacking a way to override it is clearly wrong! (Actually you can
override with LRMs around the number itself. But that's a round-about
way to do it. It doesn't allow a higher-level knowledge that the text
is Arabic to be expressed by stuffing ARLMs at starts of paragraphs and
RTL embeddings.
I learnt about this problem from http://www.yudit.org/bidi/surprise.html
which contains an example.
Now my question to Arabic-speaking people (for me this problem is purely
theoretical): Is this a problem in real life? Would you like it fixed?
The way to fix the UBA is quite obvious I think:
1. Add an Arabic RLM ("ARLM"?) char to Unicode. This is enough to make
it overridable.
2. Allow `sor`/`eor` to take an AL value in addition to the current
R and L. This value could appear by two ways:
A. The guessing of paragraph direction by first strong character
would assing AL to `sor` if the first such character is AL. This
is important because it would fix 99% of the problem implicitly.
B. Add an ARLE embedding code that is like RLE but sets `sor` to AL.
(As I argued a couple of weeks ago, it's better for backward
compatibility if it also comes with a new terminating code instead
of reusing PDF). This is probably an overkill as an RLE followed
by ARLM achieves the same effect. (I think the same of the
RLE/LRE distinction but that's another story ;-).
If you think this is worth the effort, I leave it to you to lobby for
fixing the UBA. I'm asking because if it is a good idea, I will fix it
from the beginning in the "Hierarchically Implicit Bidi" scheme I'm
designing (I can use approach 2.A to fix it implicitly at all levels).
--
Beni Cherniavsky <cben at users.sf.net>
Note: I can only read email on week-ends...
More information about the bidi
mailing list