[HarfBuzz] The canonical ordering of hamza marks

Behdad Esfahbod behdad at behdad.org
Thu Oct 17 13:05:20 PDT 2013


Khaled,

Here's what Roozbeh prepared:

===========================
Behdad,

I did a very thorough search of both the Koran and the Unicode proposals for
the new Arabic characters for the last fifteen years or so.

I could actually come up with a very simple algorithm:

First, convert the input sequence to NFD.

The order of the characters will be a bit messed up after this due to bad old
decisions in Unicode, and our goal is to make it clean. After this step, we
will have the traditional marks (ccc not in [220, 230]) at the very beginning,
with the newly encoded ones (ccc in [220, 230]) after them.

Definition: MCM, defined here, is the modifier combining marks, which actually
modify a base letter (and also have ccc=220 or 230). That means that
traditional harakat come after them in logical order, but before them in NFD.
Here is the MCM set:

0654 ARABIC HAMZA ABOVE
0655 ARABIC HAMZA BELOW
0658 ARABIC MARK NOON GHUNNA
06DC ARABIC SMALL HIGH SEEN
06E3 ARABIC SMALL LOW SEEN
06E7 ARABIC SMALL HIGH YEH
06E8 ARABIC SMALL HIGH NOON
08F3 ARABIC SMALL HIGH WAW

Following, is the order in which the combining marks after each base letter
should be read, for them to be in logical order (it could be used for both
determining rendering order, and backspacing):

1. The longest "consecutive" sequence of characters "at the beginning" the
ccc=220 part of the list that are in MCM;
2. The longest "consecutive" sequence of characters "at the beginning" of the
ccc=230 part of the list that are in MCM;
3. All the characters in the ccc=33 (shadda) part of the list;
4. All the rest of the characters (in NFD order).

Very obscure test data, just to demonstrate the algorithm:

src: 0618 0619 064E 064F 0654 0658 0653 0654 0651 0656 0651 065C 0655 0650
ccc:   30   31   30   31  230  230  230  230   33  220   33  220  220   32
MCM:                      Yes  Yes       Yes                      Yes

out: 0654 0658 0651 0651 0618 064E 0619 064F 0650 0656 065C 0655 0653 0654
ccc:  230  230   33   33   30   30   31   31   32  220  220  220  230  230
MCM:  Yes  Yes                                               Yes       Yes

Note that the algorithm guarantees canonical equivalence of the output and
input, and also guarantees the same result for all canonically equivalent strings.

Also note that you cannot replace NFD with NFC in the algorithm, because of
Alef Madda Above: 0622=0627 0653. The result of the algorithm for <Alef Madda,
Superscript Alef> should be <Alef, Superscript Alef, Madda Above> (Superscript
Alef should always come before Madda above, the sequence <Fatha, Alef Maksura,
Madda> is quite common in the Koran). If not for the exception of Alef Madda
above, an NFC version of the algorithm would work fine and in the same way.

Roozbeh
===========================

We think it's reasonable and will eventually implement something based on it.
 Please discuss.

behdad


On 12-12-18 10:59 AM, Khaled Hosny wrote:
> On Tue, Dec 18, 2012 at 12:15:45AM -0500, Behdad Esfahbod wrote:
>> On 12-12-18 12:13 AM, Khaled Hosny wrote:
>>> As for madda, Jonathan is right; it should indeed follow other marks, I
>>> don’t know what I was thinking.
>>>
>>> Some testing with people working on texts with heavy use of marks,
>>> showed that U+065C and U+06EC should precede vowel marks (but still
>>> follow the hamza).
>>
>> Thanks Khaled,
>>
>> Do you mind compiling a total order for the Arabic marks so I can (blindly) go
>> ahead and implement?
> 
> List below separated in groups ordered to the best of my knowledge,
> marks in each group should be ordered before following groups. The order
> inside each group is not important IMO but I kept them ordered by the
> existing combining classes.
> 
> Regards,
> Khaled
> 
> (the first field is the existing combining class)
> 
> 220	U+0655	◌ٕ	ARABIC HAMZA BELOW
> 220	U+065F	◌ٟ	ARABIC WAVY HAMZA BELOW
> 230	U+0654	◌ٔ	ARABIC HAMZA ABOVE
> 
> 220	U+065C	◌ٜ	ARABIC VOWEL SIGN DOT BELOW
> 230	U+06EC	◌۬	ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
> 
> 033	U+0651	◌ّ	ARABIC SHADDA
> 230	U+06DF	◌۟	ARABIC SMALL HIGH ROUNDED ZERO
> 230	U+06E0	◌۠	ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
> 
> 027	U+064B	◌ً	ARABIC FATHATAN
> 027	U+08F0	◌ࣰ	ARABIC OPEN FATHATAN
> 028	U+064C	◌ٌ	ARABIC DAMMATAN
> 028	U+08F1	◌ࣱ	ARABIC OPEN DAMMATAN
> 029	U+064D	◌ٍ	ARABIC KASRATAN
> 029	U+08F2	◌ࣲ	ARABIC OPEN KASRATAN
> 030	U+0618	◌ؘ	ARABIC SMALL FATHA
> 030	U+064E	◌َ	ARABIC FATHA
> 031	U+0619	◌ؙ	ARABIC SMALL DAMMA
> 031	U+064F	◌ُ	ARABIC DAMMA
> 032	U+061A	◌ؚ	ARABIC SMALL KASRA
> 032	U+0650	◌ِ	ARABIC KASRA
> 034	U+0652	◌ْ	ARABIC SUKUN
> 230	U+06E1	◌ۡ	ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
> 230	U+0657	◌ٗ	ARABIC INVERTED DAMMA
> 230	U+0658	◌٘	ARABIC MARK NOON GHUNNA
> 230	U+0659	◌ٙ	ARABIC ZWARAKAY
> 230	U+065A	◌ٚ	ARABIC VOWEL SIGN SMALL V ABOVE
> 230	U+065B	◌ٛ	ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
> 230	U+065D	◌ٝ	ARABIC REVERSED DAMMA
> 230	U+065E	◌ٞ	ARABIC FATHA WITH TWO DOTS
> 
> 035	U+0670	◌ٰ	ARABIC LETTER SUPERSCRIPT ALEF
> 220	U+0656	◌ٖ	ARABIC SUBSCRIPT ALEF
> 220	U+06ED	◌ۭ	ARABIC SMALL LOW MEEM
> 230	U+06E2	◌ۢ	ARABIC SMALL HIGH MEEM ISOLATED FORM
> 
> 220	U+06EA	◌۪	ARABIC EMPTY CENTRE LOW STOP
> 230	U+06EB	◌۫	ARABIC EMPTY CENTRE HIGH STOP
> 
> 220	U+06E3	◌ۣ	ARABIC SMALL LOW SEEN
> 230	U+06E7	◌ۧ	ARABIC SMALL HIGH YEH
> 230	U+06E8	◌ۨ	ARABIC SMALL HIGH NOON
> 
> 230	U+0653	◌ٓ	ARABIC MADDAH ABOVE
> 230	U+06E4	◌ۤ	ARABIC SMALL HIGH MADDA
> 
> 230	U+0610	◌ؐ	ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
> 230	U+0611	◌ؑ	ARABIC SIGN ALAYHE ASSALLAM
> 230	U+0612	◌ؒ	ARABIC SIGN RAHMATULLAH ALAYHE
> 230	U+0613	◌ؓ	ARABIC SIGN RADI ALLAHOU ANHU
> 230	U+0614	◌ؔ	ARABIC SIGN TAKHALLUS
> 
> 230	U+0615	◌ؕ	ARABIC SMALL HIGH TAH
> 230	U+0616	◌ؖ	ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
> 230	U+0617	◌ؗ	ARABIC SMALL HIGH ZAIN
> 230	U+06D6	◌ۖ	ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
> 230	U+06D7	◌ۗ	ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
> 230	U+06D8	◌ۘ	ARABIC SMALL HIGH MEEM INITIAL FORM
> 230	U+06D9	◌ۙ	ARABIC SMALL HIGH LAM ALEF
> 230	U+06DA	◌ۚ	ARABIC SMALL HIGH JEEM
> 230	U+06DB	◌ۛ	ARABIC SMALL HIGH THREE DOTS
> 230	U+06DC	◌ۜ	ARABIC SMALL HIGH SEEN
> 

-- 
behdad
http://behdad.org/



More information about the HarfBuzz mailing list