[HarfBuzz] The canonical ordering of hamza marks
Behdad Esfahbod
behdad at behdad.org
Fri Oct 18 08:15:13 PDT 2013
+roozbeh
On 13-10-18 04:52 PM, Khaled Hosny wrote:
> On Thu, Oct 17, 2013 at 10:05:20PM +0200, Behdad Esfahbod wrote:
>> Khaled,
>>
>> Here's what Roozbeh prepared:
>>
>> ===========================
>> Behdad,
>>
>> I did a very thorough search of both the Koran and the Unicode proposals for
>> the new Arabic characters for the last fifteen years or so.
>>
>> I could actually come up with a very simple algorithm:
>>
>> First, convert the input sequence to NFD.
>>
>> The order of the characters will be a bit messed up after this due to bad old
>> decisions in Unicode, and our goal is to make it clean. After this step, we
>> will have the traditional marks (ccc not in [220, 230]) at the very beginning,
>> with the newly encoded ones (ccc in [220, 230]) after them.
>>
>> Definition: MCM, defined here, is the modifier combining marks, which actually
>> modify a base letter (and also have ccc=220 or 230). That means that
>> traditional harakat come after them in logical order, but before them in NFD.
>> Here is the MCM set:
>>
>> 0654 ARABIC HAMZA ABOVE
>> 0655 ARABIC HAMZA BELOW
>> 0658 ARABIC MARK NOON GHUNNA
>> 06DC ARABIC SMALL HIGH SEEN
>> 06E3 ARABIC SMALL LOW SEEN
>> 06E7 ARABIC SMALL HIGH YEH
>> 06E8 ARABIC SMALL HIGH NOON
>> 08F3 ARABIC SMALL HIGH WAW
>
> U+0653 ARABIC MADDAH ABOVE should be added to this list, see below.
>
>> Following, is the order in which the combining marks after each base letter
>> should be read, for them to be in logical order (it could be used for both
>> determining rendering order, and backspacing):
>>
>> 1. The longest "consecutive" sequence of characters "at the beginning" the
>> ccc=220 part of the list that are in MCM;
>> 2. The longest "consecutive" sequence of characters "at the beginning" of the
>> ccc=230 part of the list that are in MCM;
>> 3. All the characters in the ccc=33 (shadda) part of the list;
>> 4. All the rest of the characters (in NFD order).
>>
>> Very obscure test data, just to demonstrate the algorithm:
>>
>> src: 0618 0619 064E 064F 0654 0658 0653 0654 0651 0656 0651 065C 0655 0650
>> ccc: 30 31 30 31 230 230 230 230 33 220 33 220 220 32
>> MCM: Yes Yes Yes Yes
>>
>> out: 0654 0658 0651 0651 0618 064E 0619 064F 0650 0656 065C 0655 0653 0654
>> ccc: 230 230 33 33 30 30 31 31 32 220 220 220 230 230
>> MCM: Yes Yes Yes Yes
>
> I think the order of Hamza below is not right, I'd expect it to come at
> least before other below marks, regardless of whether there are other
> MCM marks in the sequence or not.
>
>> Note that the algorithm guarantees canonical equivalence of the output and
>> input, and also guarantees the same result for all canonically equivalent strings.
>>
>> Also note that you cannot replace NFD with NFC in the algorithm, because of
>> Alef Madda Above: 0622=0627 0653. The result of the algorithm for <Alef Madda,
>> Superscript Alef> should be <Alef, Superscript Alef, Madda Above> (Superscript
>> Alef should always come before Madda above, the sequence <Fatha, Alef Maksura,
>> Madda> is quite common in the Koran). If not for the exception of Alef Madda
>> above, an NFC version of the algorithm would work fine and in the same way.
>
> I disagree here, 0653 is actually a special form of Hamza and should be
> treated as other MCM marks. The madda used in Quran serves a quite
> different purpose and had its own code point; U+06E4 ARABIC SMALL HIGH
> MADDA.
>
>> Roozbeh
>> ===========================
>>
>> We think it's reasonable and will eventually implement something based on it.
>> Please discuss.
>>
>> behdad
>>
>>
>> On 12-12-18 10:59 AM, Khaled Hosny wrote:
>>> On Tue, Dec 18, 2012 at 12:15:45AM -0500, Behdad Esfahbod wrote:
>>>> On 12-12-18 12:13 AM, Khaled Hosny wrote:
>>>>> As for madda, Jonathan is right; it should indeed follow other marks, I
>>>>> don’t know what I was thinking.
>>>>>
>>>>> Some testing with people working on texts with heavy use of marks,
>>>>> showed that U+065C and U+06EC should precede vowel marks (but still
>>>>> follow the hamza).
>>>>
>>>> Thanks Khaled,
>>>>
>>>> Do you mind compiling a total order for the Arabic marks so I can (blindly) go
>>>> ahead and implement?
>>>
>>> List below separated in groups ordered to the best of my knowledge,
>>> marks in each group should be ordered before following groups. The order
>>> inside each group is not important IMO but I kept them ordered by the
>>> existing combining classes.
>>>
>>> Regards,
>>> Khaled
>>>
>>> (the first field is the existing combining class)
>>>
>>> 220 U+0655 ◌ٕ ARABIC HAMZA BELOW
>>> 220 U+065F ◌ٟ ARABIC WAVY HAMZA BELOW
>>> 230 U+0654 ◌ٔ ARABIC HAMZA ABOVE
>>>
>>> 220 U+065C ◌ٜ ARABIC VOWEL SIGN DOT BELOW
>>> 230 U+06EC ◌۬ ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
>>>
>>> 033 U+0651 ◌ّ ARABIC SHADDA
>>> 230 U+06DF ◌۟ ARABIC SMALL HIGH ROUNDED ZERO
>>> 230 U+06E0 ◌۠ ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
>>>
>>> 027 U+064B ◌ً ARABIC FATHATAN
>>> 027 U+08F0 ◌ࣰ ARABIC OPEN FATHATAN
>>> 028 U+064C ◌ٌ ARABIC DAMMATAN
>>> 028 U+08F1 ◌ࣱ ARABIC OPEN DAMMATAN
>>> 029 U+064D ◌ٍ ARABIC KASRATAN
>>> 029 U+08F2 ◌ࣲ ARABIC OPEN KASRATAN
>>> 030 U+0618 ◌ؘ ARABIC SMALL FATHA
>>> 030 U+064E ◌َ ARABIC FATHA
>>> 031 U+0619 ◌ؙ ARABIC SMALL DAMMA
>>> 031 U+064F ◌ُ ARABIC DAMMA
>>> 032 U+061A ◌ؚ ARABIC SMALL KASRA
>>> 032 U+0650 ◌ِ ARABIC KASRA
>>> 034 U+0652 ◌ْ ARABIC SUKUN
>>> 230 U+06E1 ◌ۡ ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
>>> 230 U+0657 ◌ٗ ARABIC INVERTED DAMMA
>>> 230 U+0658 ◌٘ ARABIC MARK NOON GHUNNA
>>> 230 U+0659 ◌ٙ ARABIC ZWARAKAY
>>> 230 U+065A ◌ٚ ARABIC VOWEL SIGN SMALL V ABOVE
>>> 230 U+065B ◌ٛ ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
>>> 230 U+065D ◌ٝ ARABIC REVERSED DAMMA
>>> 230 U+065E ◌ٞ ARABIC FATHA WITH TWO DOTS
>>>
>>> 035 U+0670 ◌ٰ ARABIC LETTER SUPERSCRIPT ALEF
>>> 220 U+0656 ◌ٖ ARABIC SUBSCRIPT ALEF
>>> 220 U+06ED ◌ۭ ARABIC SMALL LOW MEEM
>>> 230 U+06E2 ◌ۢ ARABIC SMALL HIGH MEEM ISOLATED FORM
>>>
>>> 220 U+06EA ◌۪ ARABIC EMPTY CENTRE LOW STOP
>>> 230 U+06EB ◌۫ ARABIC EMPTY CENTRE HIGH STOP
>>>
>>> 220 U+06E3 ◌ۣ ARABIC SMALL LOW SEEN
>>> 230 U+06E7 ◌ۧ ARABIC SMALL HIGH YEH
>>> 230 U+06E8 ◌ۨ ARABIC SMALL HIGH NOON
>>>
>>> 230 U+0653 ◌ٓ ARABIC MADDAH ABOVE
>>> 230 U+06E4 ◌ۤ ARABIC SMALL HIGH MADDA
>>>
>>> 230 U+0610 ◌ؐ ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
>>> 230 U+0611 ◌ؑ ARABIC SIGN ALAYHE ASSALLAM
>>> 230 U+0612 ◌ؒ ARABIC SIGN RAHMATULLAH ALAYHE
>>> 230 U+0613 ◌ؓ ARABIC SIGN RADI ALLAHOU ANHU
>>> 230 U+0614 ◌ؔ ARABIC SIGN TAKHALLUS
>>>
>>> 230 U+0615 ◌ؕ ARABIC SMALL HIGH TAH
>>> 230 U+0616 ◌ؖ ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
>>> 230 U+0617 ◌ؗ ARABIC SMALL HIGH ZAIN
>>> 230 U+06D6 ◌ۖ ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
>>> 230 U+06D7 ◌ۗ ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
>>> 230 U+06D8 ◌ۘ ARABIC SMALL HIGH MEEM INITIAL FORM
>>> 230 U+06D9 ◌ۙ ARABIC SMALL HIGH LAM ALEF
>>> 230 U+06DA ◌ۚ ARABIC SMALL HIGH JEEM
>>> 230 U+06DB ◌ۛ ARABIC SMALL HIGH THREE DOTS
>>> 230 U+06DC ◌ۜ ARABIC SMALL HIGH SEEN
>>>
>>
>> --
>> behdad
>> http://behdad.org/
>
--
behdad
http://behdad.org/
More information about the HarfBuzz
mailing list