[utf-8] Aspell and Unicode Normalization
kevin at atkinson.dhs.org
Tue Mar 23 09:21:24 PST 2004
On Tue, 23 Mar 2004, Elias Martenson wrote:
> tis 2004-03-23 klockan 17.30 skrev Kevin Atkinson:
> > if the precomposed charter is in the target character set then (1), if
> > both the base and combing character is present than (2), otherwise (3).
> This is brilliant. It should actually be possible to handle Korean with
> By the way, do you handle triple compositions in the same way? I assume
> you do, just want confirmation.
By that do you mean Korean jamo -> Syllable blocks?
> Does anyone know if Ethiopian is decomposable?
Not officially but:
The Ethiopic Syllabary
Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it. The idea is to split
each character into two parts based on the matrix representation. The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'. The combined character will then be mapped with the upper
bits coming first. Thus each Ethiopic syllabary will have the form
`11?????? 10000???'. By mapping the first and second parts to separate
8-bit characters it is easy to tell which part represents the consonant
and which part represents the vowel of the syllabary. This encoding of
the syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell
will work well with this encoding with out any additional
modifications. However, additional improvements may be possible by
taking advantage of the consonant-vowel structure of this encoding.
In fact, the split consonant-vowel representation may prove to be so
useful that it may be beneficial to encode other syllabary in this
fashion, even if they are less than 220 of them.
The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.60. However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.
More information about the utf-8