[Fribidi-discuss] fribidi and arabic joining

Mon Mar 25 18:58:02 EST 2002

[I've put the bidi at unicode.org in CC, please remove it when 
replying if its a fribidi specific reply.]

Hi all,

Sorry for being silent, and thanks for raising the discussion.  
Both Roozbeh and I are on vacations, so none of us replied 
before.

A. First, I should show what I meant by arabic joining, and why 
we need it.  You can skip this part if you are not interested in 
arabic specific matters.  As you know fribidi is a lightweight 
portable library implementing the Unicode BiDi Algorithm.  The 
command line tool is much more useful to the Hebrew community 
than the Arab (and Iranian) ones, the reason is what Unicode 
Standard calls the Arabic Joining Algorithm.  The Arabic Joining 
algorithm, determines that which one of glyphs of each arabic 
letter should be used, depending on the surronding letters.  The 
behaviour of this algorithm is well-defined in the Unicode 
Standard, but it's interaction with BiDi Alg. is not so defined.  
The following discussion proofs that the Arabic Joining Algorithm 
cannot run after the BiDi Algorithm, at least when using 
non-fixed-width fonts:

	* The BiDi algorithm has two parts:

		1. Determining the embedding levels.
		2. Reordering.

	* The UAX#9 says that the first part should be applied on 
each paragraph, but second part should be applied on each line of 
text; and asks the higher protocol to break the lines between 
these two parts (I do not have the spec here, but look for it 
before the L1 rule).

	* The line breaking algorithm then should be applied in 
the logical text (not visual), means before reordering; but to 
break the lines you need to know the final glyphs, as the various 
glyphs of an arabic letter differ in width a lot.

	* To determine the final glyph for an arabic letter, you 
need the Arabic Joining Algorithm.

	* Then the Arabic Joining Algorithm should be applied 
before the Reordering part of the BiDi Algorithm.

	* The other reason that the Arabic Joining Algorithm 
cannot be applied after the BiDi Alg. is that the behaviour of 
the BiDi algorithm is not well-defined on BN characters like Zero 
Width Joiner U+200D and Zero Width NonJoiner U+200C which have an 
important role in the Arabic Joining Algorithm, so two different 
implementations may lead to different final glyphs.

You may think that the Arabic Joining Alg. can be applied before 
the BiDi algorithm.  But things are not so easy, the Arabic 
Joining Alg. itself needs the "Left" and "Right" character of a 
character in text, which Left and Right are defined in the visual 
text, not logical, the left and right characters cannot be found 
easily from the next and previous character of the logical order, 
because of the override marks (LRO and RLO).  Then to run the 
Arabic Joining Alg. you need the visual ordering, which can be 
determined by the BiDi Alg.!!  Roozbeh and I are working on 
preparing a proposal for the UTC discussing the interaction of 
the two algorithms (infact the first idea of adding joining to 
fribidi was from here, testing my own ideas).  But for now, two 
methods can be used to solve this circular dependency:

	* Arabic Joining and Line Breaking independency:  If we 
can prove that the Arabic Joining Algorithm is independent from 
the Link Breaking Algorithm, then do this:

		1. Reorder the text without line breakation, 
which gives us the visual order.

		2. Run the Arabic Joining Algorithm to find the 
final glyph of each character.

		3. With final glyphs in hand, we can break the 
lines and reorder the text correctly.

	* The other idea is that to extract the meaning of Left 
and Right character somehow before reordering the text, and just 
from the embedding levels found in the first part of BiDi 
Algorithm, or some other data which can be extracted from the 
bidi marks.  After playing with the counter-examples for various 
cases, I found this algorithm (read: no counter-examples found 
yet), it uses the fact that all the letters that are subject to 
change under the Arabic Joining Alg. are right to left letters 
under BiDi Alg. :

		1.  Reverse the text between each LRO and its 
corresponding PDF.  Then reverse the text in each explicit 
embedding or override in this text again.  Call this ordering of 
text the RtLFriendly order.

		Example:
			<LRO> a b C D <RLE> f g H <PDF> x Y z <PDF>
		=>	<LRO> z Y x <RLE> f g H <PDF> D C b a <PDF>
		also	<LRO> a b <RLO> f g <PDF> h <LRE> x y <PDF> Z <PDF>
		=>	<LRO> Z <LRE> x y <PDF> h <RLO> f g <PDF> b a <PDF>

		2.  Now apply the Arabic Joining Alg. on the 
RtLFriendly order with next as Left and previous as Right 
character, and find the final glyphs.

		3. With final glyphs, find the embedding levels, 
break the lines and reorder the text.

The first idea needs some work to prove the independency (which 
may not be true).  But the second one which is a bit complex 
seems to produce the desired result.  I will provide the test 
cases for different cases in another mail.

[End of BiDi vs. Arabic Joining interaction material, the rest is 
fribidi related.]

B. Our implementation of the Arabic Joining Algorithm is quite 
small and light, that will not harm the objectives of fribidi at 
all, but makes it much more useful, either the command line tool 
(that can be used to cat right to left files), and the library.  
Many applications that use fribidi do not support Arabic Joining 
as there is no light-weight implementation of it availble, or the 
author just wanted it to work for hebrew.  But with Arabic 
Joining in fribidi the developer can just easily turn the arabic 
joining on to work well for arabic too.

C. The Pango is not a real solution for the audience of fribidi:  
fribidi has been ported to some mobile devices.  Also fribidi has 
been used on linux console and xterm, that is not a good idea to 
use pango for arabic joining there.  fribidi is mostly used for 
hebrew and arabic scripts, which their rendering will be 
completed with arabic joining algorithm, then we should not worry 
about other shaping matters, when shaping of all the Unicode 
characters is needed, the fribidi feature can be turned off.

D.  Using the Unicode Arabic Presentation Forms is also essential 
with Linux console, as the kernel maps the Unicode codepoints to 
glyphs, for other scripts like syriac which does not have the 
presentation forms in unicode, their presentaion forms should be 
registered in the private area of unicode (H. Peter Anvin is 
responsible for registering them in linux), to  be able to show 
them in linux console.

E.  About the overhead of it on fribidi, I believe that the 
hebrew community should not be so happy, but:

	1.  It can be fully turned out with a configure time 
option.
	2.  When compiled with Arabic Joining, by default its 
off, the developer should turn it on if needed.
	3.  I try to put it in a different binary to save the 
resources.

I hope that with the above discussion there will be enough 
reasons for all of you to put it in fribidi.

Yours,
-- Behdad Esfahbod			 6 Farvardin 1381, 2002 Mar 26
<behdad at bamdad dot org>		[Finger for Geek Code]