[HarfBuzz] The canonical ordering of hamza marks

Behdad Esfahbod behdad at behdad.org
Fri Oct 18 08:15:13 PDT 2013


+roozbeh

On 13-10-18 04:52 PM, Khaled Hosny wrote:
> On Thu, Oct 17, 2013 at 10:05:20PM +0200, Behdad Esfahbod wrote:
>> Khaled,
>>
>> Here's what Roozbeh prepared:
>>
>> ===========================
>> Behdad,
>>
>> I did a very thorough search of both the Koran and the Unicode proposals for
>> the new Arabic characters for the last fifteen years or so.
>>
>> I could actually come up with a very simple algorithm:
>>
>> First, convert the input sequence to NFD.
>>
>> The order of the characters will be a bit messed up after this due to bad old
>> decisions in Unicode, and our goal is to make it clean. After this step, we
>> will have the traditional marks (ccc not in [220, 230]) at the very beginning,
>> with the newly encoded ones (ccc in [220, 230]) after them.
>>
>> Definition: MCM, defined here, is the modifier combining marks, which actually
>> modify a base letter (and also have ccc=220 or 230). That means that
>> traditional harakat come after them in logical order, but before them in NFD.
>> Here is the MCM set:
>>
>> 0654 ARABIC HAMZA ABOVE
>> 0655 ARABIC HAMZA BELOW
>> 0658 ARABIC MARK NOON GHUNNA
>> 06DC ARABIC SMALL HIGH SEEN
>> 06E3 ARABIC SMALL LOW SEEN
>> 06E7 ARABIC SMALL HIGH YEH
>> 06E8 ARABIC SMALL HIGH NOON
>> 08F3 ARABIC SMALL HIGH WAW
> 
> U+0653 ARABIC MADDAH ABOVE should be added to this list, see below.
> 
>> Following, is the order in which the combining marks after each base letter
>> should be read, for them to be in logical order (it could be used for both
>> determining rendering order, and backspacing):
>>
>> 1. The longest "consecutive" sequence of characters "at the beginning" the
>> ccc=220 part of the list that are in MCM;
>> 2. The longest "consecutive" sequence of characters "at the beginning" of the
>> ccc=230 part of the list that are in MCM;
>> 3. All the characters in the ccc=33 (shadda) part of the list;
>> 4. All the rest of the characters (in NFD order).
>>
>> Very obscure test data, just to demonstrate the algorithm:
>>
>> src: 0618 0619 064E 064F 0654 0658 0653 0654 0651 0656 0651 065C 0655 0650
>> ccc:   30   31   30   31  230  230  230  230   33  220   33  220  220   32
>> MCM:                      Yes  Yes       Yes                      Yes
>>
>> out: 0654 0658 0651 0651 0618 064E 0619 064F 0650 0656 065C 0655 0653 0654
>> ccc:  230  230   33   33   30   30   31   31   32  220  220  220  230  230
>> MCM:  Yes  Yes                                               Yes       Yes
> 
> I think the order of Hamza below is not right, I'd expect it to come at
> least before other below marks, regardless of whether there are other
> MCM marks in the sequence or not.
> 
>> Note that the algorithm guarantees canonical equivalence of the output and
>> input, and also guarantees the same result for all canonically equivalent strings.
>>
>> Also note that you cannot replace NFD with NFC in the algorithm, because of
>> Alef Madda Above: 0622=0627 0653. The result of the algorithm for <Alef Madda,
>> Superscript Alef> should be <Alef, Superscript Alef, Madda Above> (Superscript
>> Alef should always come before Madda above, the sequence <Fatha, Alef Maksura,
>> Madda> is quite common in the Koran). If not for the exception of Alef Madda
>> above, an NFC version of the algorithm would work fine and in the same way.
> 
> I disagree here, 0653 is actually a special form of Hamza and should be
> treated as other MCM marks. The madda used in Quran serves a quite
> different purpose and had its own code point; U+06E4 ARABIC SMALL HIGH
> MADDA. 
> 
>> Roozbeh
>> ===========================
>>
>> We think it's reasonable and will eventually implement something based on it.
>>  Please discuss.
>>
>> behdad
>>
>>
>> On 12-12-18 10:59 AM, Khaled Hosny wrote:
>>> On Tue, Dec 18, 2012 at 12:15:45AM -0500, Behdad Esfahbod wrote:
>>>> On 12-12-18 12:13 AM, Khaled Hosny wrote:
>>>>> As for madda, Jonathan is right; it should indeed follow other marks, I
>>>>> don’t know what I was thinking.
>>>>>
>>>>> Some testing with people working on texts with heavy use of marks,
>>>>> showed that U+065C and U+06EC should precede vowel marks (but still
>>>>> follow the hamza).
>>>>
>>>> Thanks Khaled,
>>>>
>>>> Do you mind compiling a total order for the Arabic marks so I can (blindly) go
>>>> ahead and implement?
>>>
>>> List below separated in groups ordered to the best of my knowledge,
>>> marks in each group should be ordered before following groups. The order
>>> inside each group is not important IMO but I kept them ordered by the
>>> existing combining classes.
>>>
>>> Regards,
>>> Khaled
>>>
>>> (the first field is the existing combining class)
>>>
>>> 220	U+0655	◌ٕ	ARABIC HAMZA BELOW
>>> 220	U+065F	◌ٟ	ARABIC WAVY HAMZA BELOW
>>> 230	U+0654	◌ٔ	ARABIC HAMZA ABOVE
>>>
>>> 220	U+065C	◌ٜ	ARABIC VOWEL SIGN DOT BELOW
>>> 230	U+06EC	◌۬	ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
>>>
>>> 033	U+0651	◌ّ	ARABIC SHADDA
>>> 230	U+06DF	◌۟	ARABIC SMALL HIGH ROUNDED ZERO
>>> 230	U+06E0	◌۠	ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
>>>
>>> 027	U+064B	◌ً	ARABIC FATHATAN
>>> 027	U+08F0	◌ࣰ	ARABIC OPEN FATHATAN
>>> 028	U+064C	◌ٌ	ARABIC DAMMATAN
>>> 028	U+08F1	◌ࣱ	ARABIC OPEN DAMMATAN
>>> 029	U+064D	◌ٍ	ARABIC KASRATAN
>>> 029	U+08F2	◌ࣲ	ARABIC OPEN KASRATAN
>>> 030	U+0618	◌ؘ	ARABIC SMALL FATHA
>>> 030	U+064E	◌َ	ARABIC FATHA
>>> 031	U+0619	◌ؙ	ARABIC SMALL DAMMA
>>> 031	U+064F	◌ُ	ARABIC DAMMA
>>> 032	U+061A	◌ؚ	ARABIC SMALL KASRA
>>> 032	U+0650	◌ِ	ARABIC KASRA
>>> 034	U+0652	◌ْ	ARABIC SUKUN
>>> 230	U+06E1	◌ۡ	ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
>>> 230	U+0657	◌ٗ	ARABIC INVERTED DAMMA
>>> 230	U+0658	◌٘	ARABIC MARK NOON GHUNNA
>>> 230	U+0659	◌ٙ	ARABIC ZWARAKAY
>>> 230	U+065A	◌ٚ	ARABIC VOWEL SIGN SMALL V ABOVE
>>> 230	U+065B	◌ٛ	ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
>>> 230	U+065D	◌ٝ	ARABIC REVERSED DAMMA
>>> 230	U+065E	◌ٞ	ARABIC FATHA WITH TWO DOTS
>>>
>>> 035	U+0670	◌ٰ	ARABIC LETTER SUPERSCRIPT ALEF
>>> 220	U+0656	◌ٖ	ARABIC SUBSCRIPT ALEF
>>> 220	U+06ED	◌ۭ	ARABIC SMALL LOW MEEM
>>> 230	U+06E2	◌ۢ	ARABIC SMALL HIGH MEEM ISOLATED FORM
>>>
>>> 220	U+06EA	◌۪	ARABIC EMPTY CENTRE LOW STOP
>>> 230	U+06EB	◌۫	ARABIC EMPTY CENTRE HIGH STOP
>>>
>>> 220	U+06E3	◌ۣ	ARABIC SMALL LOW SEEN
>>> 230	U+06E7	◌ۧ	ARABIC SMALL HIGH YEH
>>> 230	U+06E8	◌ۨ	ARABIC SMALL HIGH NOON
>>>
>>> 230	U+0653	◌ٓ	ARABIC MADDAH ABOVE
>>> 230	U+06E4	◌ۤ	ARABIC SMALL HIGH MADDA
>>>
>>> 230	U+0610	◌ؐ	ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
>>> 230	U+0611	◌ؑ	ARABIC SIGN ALAYHE ASSALLAM
>>> 230	U+0612	◌ؒ	ARABIC SIGN RAHMATULLAH ALAYHE
>>> 230	U+0613	◌ؓ	ARABIC SIGN RADI ALLAHOU ANHU
>>> 230	U+0614	◌ؔ	ARABIC SIGN TAKHALLUS
>>>
>>> 230	U+0615	◌ؕ	ARABIC SMALL HIGH TAH
>>> 230	U+0616	◌ؖ	ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
>>> 230	U+0617	◌ؗ	ARABIC SMALL HIGH ZAIN
>>> 230	U+06D6	◌ۖ	ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
>>> 230	U+06D7	◌ۗ	ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
>>> 230	U+06D8	◌ۘ	ARABIC SMALL HIGH MEEM INITIAL FORM
>>> 230	U+06D9	◌ۙ	ARABIC SMALL HIGH LAM ALEF
>>> 230	U+06DA	◌ۚ	ARABIC SMALL HIGH JEEM
>>> 230	U+06DB	◌ۛ	ARABIC SMALL HIGH THREE DOTS
>>> 230	U+06DC	◌ۜ	ARABIC SMALL HIGH SEEN
>>>
>>
>> -- 
>> behdad
>> http://behdad.org/
> 

-- 
behdad
http://behdad.org/



More information about the HarfBuzz mailing list