[HarfBuzz] The canonical ordering of hamza marks

Fri Oct 18 17:00:37 PDT 2013

This is all well understood, quite unfortunate but that is how Unicode
is currently is, but…

1) The original issue that started this (sub)thread; if I have
   <U+0622,U+0670> or <U+0627,U+0653,U+0670>, I expect the small Alef to
   be above the Madda, placing the Madda above the small Alef will give
   the sequence a totally new meaning, this is unacceptable IMHO.

2) The fact that U+0622 is canonically equivalent to <U+0627,U+0653>
   pretty much rules out the ability to use of U+0653 as a vowel mark,
   no other vowel mark in the Arabic block exhibit such a normalization
   behaviour. This shouldn’t prevent some Arabic-script using language
   from using it as a modifier, as I would expect it then to still
   behave as an MCM mark.

Regards,
Khaled

On Fri, Oct 18, 2013 at 04:33:01PM -0700, Roozbeh Pournader wrote:
> Let me try to approach the problem from another angle.
> 
> Unicode, although originally planned to be more semantic, has become more
> and more a graphical encoding. This can be evidenced by the new characters
> encoded or not encoded. The UTC continuously refers people to use existing
> code points for things that are graphically similar to already-encoded
> characters but are semantically very different, but encodes new characters
> that are semantically the same as existing characters, but their exact
> visual representation is important and is based on rules that are very hard
> to derive.
> 
> This is inevitable to some degree, since text rendering technology and
> fonts should not be expected to be very complex. So plain text
> representation becomes more visual in order to make life easier for the
> rendering engines.
> 
> This can be evidenced by a lot of the newer characters in the Arabic
> blocks. The open tanweens or arrowheads in the Arabic Extended-A block were
> encoded because they were graphically different, while the committee did
> not encode a "waw with madda above" and recommended "waw+madda above" to be
> used for it instead. The diacritical hamza was the most controversial, and
> the controversy is the main reason for the hole at U+08A1 (it is reserved
> for a Beh With Hamza Above, which will be in Unicode 7.0).
> 
> All in all, this means that UTC considers anything that very much looks
> like U+0653 a madda above, and anything that may need to be visually
> distinguished from it and be smaller in size a small high madda. The glyphs
> used in the chart show a significant size difference, and has been showing
> that difference since the small high madda got encoded in Unicode 2.0.
> Unicode actually doesn't prescribe exact usage of a lot of the Koranic
> marks, because the marks may be used very differently across the various
> Koranic traditions from Indonesia to Morocco.
> 
> I don't think it's a good idea to consider madda to be a certain kind of
> hamza. Yes, in the modern Arabic language Alef+madda above is semantically
> equivalent to hamza+alef or alef+alef, but there is no hint of a hamza
> semantic when some minority languages using the Arabic script takes a madda
> and puts it over a waw to get a new vowel.
> 
> I understand that means that there may be no "real" semantic difference
> between a normal madda and a small high madda, but there's really no
> semantic difference between a yeh and a farsi yeh either, and they are
> separately encoded. Unicode is quite graphical in its encoding.
> 
> Regarding U+06C7 and U+06C8, the UTC has agreed to not encode such
> characters anymore, except for the use of hamza above for diacritic usages
> of non-hamza semantics. So there may as well be future siblings for U+0681,
> U+076C, U+08A1, and U+08A8, but no future siblings to U+06C7 and U+06C8.
> 
> Please tell me if there's anything I've missed to address.
> 
> 
> On Fri, Oct 18, 2013 at 3:18 PM, Khaled Hosny <khaledhosny at eglug.org> wrote:
> 
> > On Fri, Oct 18, 2013 at 02:57:43PM -0700, Roozbeh Pournader wrote:
> > > Khaled, you are referring to a specific style of writing the Koran. There
> > > are several others, which Unicode should be able to represent.
> >
> > I’m not sure I follow here, if you think there should be a way to
> > differentiate between two forms of prolongation mark (aka Quranic
> > Madda), something I have never seen but i’m open to learn something new,
> > then a new code point should be encoded, instead of abusing a Hamza (aka
> > the other Madda) that has an incompatible normalization behaviour in
> > Unicode.
> >
> > And you ignored my other point.
> >
> > Regards,
> > Khaled
> >
> > > On Fri, Oct 18, 2013 at 2:47 PM, Khaled Hosny <khaledhosny at eglug.org>
> > wrote:
> > >
> > > > On Fri, Oct 18, 2013 at 02:26:15PM -0700, Roozbeh Pournader wrote:
> > > > > On Fri, Oct 18, 2013 at 2:23 PM, Khaled Hosny <khaledhosny at eglug.org
> > >
> > > > wrote:
> > > > >
> > > > > > Furthermore, <alef,quranic madda> ≠ <alef with madda above>
> > > > > >
> > > > >
> > > > > Why?
> > > >
> > > > Because every Mushaf printed in Egypt (and most of the Arabic world)
> > > > since 1919[1] has a note at the end of Madda description stating that
> > “…
> > > > and this mark should not be used to indicate an omitted Alef after[sic]
> > > > a written Alef, as in آمنوا, that were mistakingly put in many
> > > > Mushafs …”, which to me is a very frank indication that the two marks
> > > > are not the same thing.
> > > >
> > > > Also a vowel mark (which the Quranic Madda is) should not “blend” with
> > > > its base letter, the same way that U+06C7 is not canonically equivalent
> > > > to <U+0648,U+064F> etc.
> > > >
> > > > Regards,
> > > > Khaled
> > > >
> > > > 1. The date of first Mushaf printed by Al-Azhar where most of the
> > > > Quranic annotation marks were formalized and standardized.
> > > >
> >