[HarfBuzz] The canonical ordering of hamza marks

Khaled Hosny khaledhosny at eglug.org
Fri Oct 18 07:52:02 PDT 2013


On Thu, Oct 17, 2013 at 10:05:20PM +0200, Behdad Esfahbod wrote:
> Khaled,
> 
> Here's what Roozbeh prepared:
> 
> ===========================
> Behdad,
> 
> I did a very thorough search of both the Koran and the Unicode proposals for
> the new Arabic characters for the last fifteen years or so.
> 
> I could actually come up with a very simple algorithm:
> 
> First, convert the input sequence to NFD.
> 
> The order of the characters will be a bit messed up after this due to bad old
> decisions in Unicode, and our goal is to make it clean. After this step, we
> will have the traditional marks (ccc not in [220, 230]) at the very beginning,
> with the newly encoded ones (ccc in [220, 230]) after them.
> 
> Definition: MCM, defined here, is the modifier combining marks, which actually
> modify a base letter (and also have ccc=220 or 230). That means that
> traditional harakat come after them in logical order, but before them in NFD.
> Here is the MCM set:
> 
> 0654 ARABIC HAMZA ABOVE
> 0655 ARABIC HAMZA BELOW
> 0658 ARABIC MARK NOON GHUNNA
> 06DC ARABIC SMALL HIGH SEEN
> 06E3 ARABIC SMALL LOW SEEN
> 06E7 ARABIC SMALL HIGH YEH
> 06E8 ARABIC SMALL HIGH NOON
> 08F3 ARABIC SMALL HIGH WAW

U+0653 ARABIC MADDAH ABOVE should be added to this list, see below.

> Following, is the order in which the combining marks after each base letter
> should be read, for them to be in logical order (it could be used for both
> determining rendering order, and backspacing):
> 
> 1. The longest "consecutive" sequence of characters "at the beginning" the
> ccc=220 part of the list that are in MCM;
> 2. The longest "consecutive" sequence of characters "at the beginning" of the
> ccc=230 part of the list that are in MCM;
> 3. All the characters in the ccc=33 (shadda) part of the list;
> 4. All the rest of the characters (in NFD order).
> 
> Very obscure test data, just to demonstrate the algorithm:
> 
> src: 0618 0619 064E 064F 0654 0658 0653 0654 0651 0656 0651 065C 0655 0650
> ccc:   30   31   30   31  230  230  230  230   33  220   33  220  220   32
> MCM:                      Yes  Yes       Yes                      Yes
> 
> out: 0654 0658 0651 0651 0618 064E 0619 064F 0650 0656 065C 0655 0653 0654
> ccc:  230  230   33   33   30   30   31   31   32  220  220  220  230  230
> MCM:  Yes  Yes                                               Yes       Yes

I think the order of Hamza below is not right, I'd expect it to come at
least before other below marks, regardless of whether there are other
MCM marks in the sequence or not.

> Note that the algorithm guarantees canonical equivalence of the output and
> input, and also guarantees the same result for all canonically equivalent strings.
> 
> Also note that you cannot replace NFD with NFC in the algorithm, because of
> Alef Madda Above: 0622=0627 0653. The result of the algorithm for <Alef Madda,
> Superscript Alef> should be <Alef, Superscript Alef, Madda Above> (Superscript
> Alef should always come before Madda above, the sequence <Fatha, Alef Maksura,
> Madda> is quite common in the Koran). If not for the exception of Alef Madda
> above, an NFC version of the algorithm would work fine and in the same way.

I disagree here, 0653 is actually a special form of Hamza and should be
treated as other MCM marks. The madda used in Quran serves a quite
different purpose and had its own code point; U+06E4 ARABIC SMALL HIGH
MADDA. 

> Roozbeh
> ===========================
> 
> We think it's reasonable and will eventually implement something based on it.
>  Please discuss.
> 
> behdad
> 
> 
> On 12-12-18 10:59 AM, Khaled Hosny wrote:
> > On Tue, Dec 18, 2012 at 12:15:45AM -0500, Behdad Esfahbod wrote:
> >> On 12-12-18 12:13 AM, Khaled Hosny wrote:
> >>> As for madda, Jonathan is right; it should indeed follow other marks, I
> >>> don’t know what I was thinking.
> >>>
> >>> Some testing with people working on texts with heavy use of marks,
> >>> showed that U+065C and U+06EC should precede vowel marks (but still
> >>> follow the hamza).
> >>
> >> Thanks Khaled,
> >>
> >> Do you mind compiling a total order for the Arabic marks so I can (blindly) go
> >> ahead and implement?
> > 
> > List below separated in groups ordered to the best of my knowledge,
> > marks in each group should be ordered before following groups. The order
> > inside each group is not important IMO but I kept them ordered by the
> > existing combining classes.
> > 
> > Regards,
> > Khaled
> > 
> > (the first field is the existing combining class)
> > 
> > 220	U+0655	◌ٕ	ARABIC HAMZA BELOW
> > 220	U+065F	◌ٟ	ARABIC WAVY HAMZA BELOW
> > 230	U+0654	◌ٔ	ARABIC HAMZA ABOVE
> > 
> > 220	U+065C	◌ٜ	ARABIC VOWEL SIGN DOT BELOW
> > 230	U+06EC	◌۬	ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
> > 
> > 033	U+0651	◌ّ	ARABIC SHADDA
> > 230	U+06DF	◌۟	ARABIC SMALL HIGH ROUNDED ZERO
> > 230	U+06E0	◌۠	ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
> > 
> > 027	U+064B	◌ً	ARABIC FATHATAN
> > 027	U+08F0	◌ࣰ	ARABIC OPEN FATHATAN
> > 028	U+064C	◌ٌ	ARABIC DAMMATAN
> > 028	U+08F1	◌ࣱ	ARABIC OPEN DAMMATAN
> > 029	U+064D	◌ٍ	ARABIC KASRATAN
> > 029	U+08F2	◌ࣲ	ARABIC OPEN KASRATAN
> > 030	U+0618	◌ؘ	ARABIC SMALL FATHA
> > 030	U+064E	◌َ	ARABIC FATHA
> > 031	U+0619	◌ؙ	ARABIC SMALL DAMMA
> > 031	U+064F	◌ُ	ARABIC DAMMA
> > 032	U+061A	◌ؚ	ARABIC SMALL KASRA
> > 032	U+0650	◌ِ	ARABIC KASRA
> > 034	U+0652	◌ْ	ARABIC SUKUN
> > 230	U+06E1	◌ۡ	ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
> > 230	U+0657	◌ٗ	ARABIC INVERTED DAMMA
> > 230	U+0658	◌٘	ARABIC MARK NOON GHUNNA
> > 230	U+0659	◌ٙ	ARABIC ZWARAKAY
> > 230	U+065A	◌ٚ	ARABIC VOWEL SIGN SMALL V ABOVE
> > 230	U+065B	◌ٛ	ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
> > 230	U+065D	◌ٝ	ARABIC REVERSED DAMMA
> > 230	U+065E	◌ٞ	ARABIC FATHA WITH TWO DOTS
> > 
> > 035	U+0670	◌ٰ	ARABIC LETTER SUPERSCRIPT ALEF
> > 220	U+0656	◌ٖ	ARABIC SUBSCRIPT ALEF
> > 220	U+06ED	◌ۭ	ARABIC SMALL LOW MEEM
> > 230	U+06E2	◌ۢ	ARABIC SMALL HIGH MEEM ISOLATED FORM
> > 
> > 220	U+06EA	◌۪	ARABIC EMPTY CENTRE LOW STOP
> > 230	U+06EB	◌۫	ARABIC EMPTY CENTRE HIGH STOP
> > 
> > 220	U+06E3	◌ۣ	ARABIC SMALL LOW SEEN
> > 230	U+06E7	◌ۧ	ARABIC SMALL HIGH YEH
> > 230	U+06E8	◌ۨ	ARABIC SMALL HIGH NOON
> > 
> > 230	U+0653	◌ٓ	ARABIC MADDAH ABOVE
> > 230	U+06E4	◌ۤ	ARABIC SMALL HIGH MADDA
> > 
> > 230	U+0610	◌ؐ	ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
> > 230	U+0611	◌ؑ	ARABIC SIGN ALAYHE ASSALLAM
> > 230	U+0612	◌ؒ	ARABIC SIGN RAHMATULLAH ALAYHE
> > 230	U+0613	◌ؓ	ARABIC SIGN RADI ALLAHOU ANHU
> > 230	U+0614	◌ؔ	ARABIC SIGN TAKHALLUS
> > 
> > 230	U+0615	◌ؕ	ARABIC SMALL HIGH TAH
> > 230	U+0616	◌ؖ	ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
> > 230	U+0617	◌ؗ	ARABIC SMALL HIGH ZAIN
> > 230	U+06D6	◌ۖ	ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
> > 230	U+06D7	◌ۗ	ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
> > 230	U+06D8	◌ۘ	ARABIC SMALL HIGH MEEM INITIAL FORM
> > 230	U+06D9	◌ۙ	ARABIC SMALL HIGH LAM ALEF
> > 230	U+06DA	◌ۚ	ARABIC SMALL HIGH JEEM
> > 230	U+06DB	◌ۛ	ARABIC SMALL HIGH THREE DOTS
> > 230	U+06DC	◌ۜ	ARABIC SMALL HIGH SEEN
> > 
> 
> -- 
> behdad
> http://behdad.org/



More information about the HarfBuzz mailing list