[Fribidi-discuss] Re: BiDi WINE status and fribidi

Sun Aug 25 11:25:02 EST 2002

Behdad Esfahbod wrote:

>On Sun, 25 Aug 2002, Shachar Shemesh wrote:
>
>  
>
>>Not looking at the source yet, I may be talking utter bullshit here.
>>
>>What I think will be the proper thing to do is to change two locations:
>>1. When classifying the characters, do the surrogate unification, lookup 
>>the combined code point, and then mark both parts of the surrogate as 
>>the same type.
>>2. When reordering, if the level is odd (right to left), and the char is 
>>a surrogate, don't change the order of the pair.
>>
>>As far as I can see, these are the only changes required in order to 
>>support UTF-16. They can be relatively trivially extended to support UTF-8.
>>
>>If anyone who has actually had a look at the code has anything to 
>>correct me, please do.
>>
>>            Shachar
>>    
>>
>
>You are simplifying things too much.  The two easy steps are what 
>you told.  But there are some harder ones two:
>
>	* Rule W5:  States that if there is just *one* char of 
>type ....   Then you should be aware that a surrogate pair is 
>just one character, not two.
>
>	* Rule L3:  NSMs in RTL levels should be reordered to 
>come after their base, now the problem is that both the NSM can 
>be a surrogate pair, and the base can be a surrogate.  
>Headache...
>  
>
So that's why we pay you - to know those things (what do you mean you 
havn't gotten the cheque. I mailed it myself yesturday!)

>So please please don't talk about UTF-8, thats already enough.
>
The voice of reason. Ok, you are, of course, right.

>
>Yours,
>  
>

Ok, let's see.
Since we have accepted my proposal of marking both chars of the 
surrogate with the codepoint's type, only rules that apply to a single 
letter need any special processing at all.

Let's review them, then:
Rule W4 - European seperator between european numbers. Only the 
seperatore is affected.
(Rule W5 discusses a sequence of characters of the same type. Are you 
sure it's relevant?).
I have seen no more rules that seem to apply (rule L3 doesn't seem 
related to the rule Behdad quoted, and the rule Behdad quoted seem, it 
appears, to be covered by the second assumption I originally took. I 
suspect I am misunderstanding here).

If we add to that the fact that ALL surrogated characters (i.e. - all 
characters whose code point is higher than 0xFFFF) are L (table at 
http://www.unicode.org/unicode/reports/tr9/#Bidirectional_Character_Types), 
I don't think my original suggestion of a change needs amendment 
(barring the warning at the bottom: "Unassigned characters are given 
strong types in the algorithm. This is an explicit exception to the 
general Unicode conformance requirements with respect to unassigned 
characters. As characters become assigned in the future, these 
bidirectional types may change.").

Behdad, I think I'm missing something here. I was using version 10 of 
the 3.2 standard 
(http://www.unicode.org/unicode/reports/tr9/tr9-10.html). The rule 
numbers seem a bit wrong, and the quotes you give do not appear at all.

                    Shachar