[utf-8] Languages Which Aspell can to Support

Kevin Atkinson kevin at atkinson.dhs.org
Tue Mar 16 22:28:09 PST 2004


Since everyone gave me such a hard time about my decision to keep Aspell 
8-bit internally here is the text of an Appendix I just added to the 
Aspell manual.

Appendix B Languages Which Aspell can to Support
************************************************

Even though Aspell will remain 8-bit internally it should still be be
able to support any written languages not based on a logographic
script.  A only logographic writing system in current use are those
based on hanzi which includes Chinese, Japanese, and sometimes Korean.

B.1 Languages with 220 or Fewer Unique Symbols
==============================================

Aspell 0.51 should be able to support the following languages as, to
the best of my knowledge, they all contain 220 or fewer symbols and can
thus, fit within an 8-bit character set.  If an existing character set
does not exists than a new one can be invented.  This is true even if
the script is not yet supported by Unicode as the private use area can
be used.

Code   Language Name             Script               Dictionary   Gettext
                                                      Available    Translation

ab     Abkhazian                 Cyrillic             -            -
ae     Avestan                   Avestan              -            -
af     Afrikaans                 Latin                Yes          -
an     Aragonese                 Latin                -            -
ar     Arabic                    Arabic               -            -
as     Assamese                  Bengali              -            -
ay     Aymara                    Latin                -            -
az     Azerbaijani               Arabic               -            -
az                               Cyrillic             -            -
az                               Latin                -            -

ba     Bashkir                   Cyrillic             -            -
be     Belarusian                Cyrillic             -            Yes
bg     Bulgarian                 Cyrillic             Yes          -
bh     Bihari                    Devanagari           -            -
bn     Bengali                   Bengali              -            -
bo     Tibetan                   Tibetan              -            -
br     Breton                    Latin                Yes          -
bs     Bosnian                   Latin                -            -

ca     Catalan/Valencian         Latin                Yes          -
ce     Chechen                   Cyrillic             -            -
ch     Chamorro                  Latin                -            -
co     Corsican                  Latin                -            -
cr     Cree                      Canadian Syllabics   -            -
cr                               Latin                -            -
cs     Czech                     Latin                Yes          -
cv     Chuvash                   Cyrillic             -            -
cy     Welsh                     Latin                Yes          -

da     Danish                    Latin                Yes          -
de     German                    Latin                Yes          -
dv     Divehi                    Dhives Akuru         -            -
dz     Dzongkha                  Tibetan              -            -

el     Greek                     Greek                Yes          -
en     English                   Latin                Yes          -
eo     Esperanto                 Latin                Yes          -
es     Spanish                   Latin                Yes          Incomplete
et     Estonian                  Latin                -            -
eu     Basque                    Latin                -            -

fa     Persian                   Arabic               -            -
fi     Finnish                   Latin                -            -
fj     Fijian                    Latin                -            -
fo     Faroese                   Latin                Yes          -
fr     French                    Latin                Yes          Yes
fy     Frisian                   Latin                -            -

ga     Irish                     Latin                Yes          Yes
gd     Scottish Gaelic           Latin                -            -
gl     Gallegan                  Latin                Yes          -
gn     Guarani                   Latin                -            -
gu     Gujarati                  Gujarati             -            -
gv     Manx                      Latin                -            -

ha     Hausa                     Latin                -            -
he     Hebrew                    Hebrew               -            -
hi     Hindi                     Devanagari           -            -
hr     Croatian                  Latin                Yes          -
hu     Hungarian                 Latin                -            -
hy     Armenian                  Armenian             -            -

id     Indonesian                Arabic               -            -
id                               Latin                Yes          -
io     Ido                       Latin                -            -
is     Icelandic                 Latin                Yes          -
it     Italian                   Latin                Yes          -
iu     Inuktitut                 Canadian Syllabics   -            -
iu                               Latin                -            -

ja     Japanese                  Latin                -            -
jv     Javanese                  Javanese             -            -
jv                               Latin                -            -

ka     Georgian                  Georgian             -            -
kk     Kazakh                    Cyrillic             -            -
kl     Kalaallisut/Greenlandic   Latin                -            -
km     Khmer                     Khmer                -            -
kn     Kannada                   Kannada              -            -
ko     Korean                    Hangeul              -            -
kr     Kanuri                    Latin                -            -
ks     Kashmiri                  Arabic               -            -
ks                               Devanagari           -            -
ku     Kurdish                   Arabic               -            -
ku                               Cyrillic             -            -
ku                               Latin                -            -
kv     Komi                      Cyrillic             -            -
kw     Cornish                   Latin                -            -
ky     Kirghiz                   Arabic               -            -
ky                               Cyrillic             -            -
ky                               Latin                -            -

la     Latin                     Latin                -            -
lo     Lao                       Lao                  -            -
lt     Lithuanian                Latin                -            -
lv     Latvian                   Latin                -            -

mi     Maori                     Latin                Yes          -
mk     Makasar                   Lontara/Makasar      -            -
ml     Malayalam                 Latin                -            -
ml                               Malayalam            -            -
mn     Mongolian                 Cyrillic             -            -
mn                               Mongolian            -            -
mo     Moldavian                 Cyrillic             -            -
mr     Marathi                   Devanagari           -            -
ms     Malay                     Arabic               -            -
ms                               Latin                Yes          -
mt     Maltese                   Latin                -            -
my     Burmese                   Myanmar              -            -

ne     Nepali                    Devanagari           -            -
nl     Dutch                     Latin                Yes          Yes
no     Norwegian                 Latin                Yes          -
nv     Navajo                    Latin                -            -

oc     Occitan/Provencal         Latin                -            -
oj     Ojibwa                    Ojibwe               -            -
or     Oriya                     Oriya                -            -
os     Ossetic                   Cyrillic             -            -

pa     Punjabi                   Gurmukhi             -            -
pi     Pali                      Devanagari           -            -
pi                               Sinhala              -            -
pl     Polish                    Latin                Yes          -
pt     Portuguese                Latin                Yes          -

qu     Quechua                   Latin                -            -

rm     Raeto-Romance             Latin                -            -
ro     Romanian                  Latin                Yes          -
ru     Russian                   Cyrillic             Yes          -

sa     Sanskrit                  Devanagari           -            -
sa                               Sinhala              -            -
sd     Sindhi                    Arabic               -            -
sk     Slovak                    Latin                Yes          -
sl     Slovenian                 Latin                Yes          -
sn     Shona                     Latin                -            -
so     Somali                    Latin                -            -
sq     Albanian                  Latin                -            -
sr     Serbian                   Cyrillic             -            Yes
su     Sundanese                 Latin                -            -
sv     Swedish                   Latin                Yes          -
sw     Swahili                   Latin                -            -

ta     Tamil                     Tamil                -            -
te     Telugu                    Telugu               -            -
tg     Tajik                     Arabic               -            -
tg                               Cyrillic             -            -
tg                               Latin                -            -
tk     Turkmen                   Arabic               -            -
tk                               Cyrillic             -            -
tk                               Latin                -            -
tl     Tagalog                   Latin                -            -
tl                               Tagalog              -            -
tr     Turkish                   Arabic               -            -
tr                               Latin                -            -
tt     Tatar                     Cyrillic             -            -
ty     Tahitian                  Latin                -            -

ug     Uighur                    Arabic               -            -
ug                               Cyrillic             -            -
ug                               Latin                -            -
ug                               Uyghur               -            -
uk     Ukrainian                 Cyrillic             Yes          -
ur     Urdu                      Arabic               -            -
uz     Uzbek                     Cyrillic             -            -
uz                               Latin                -            -

vi     Vietnamese                Latin                -            -
vo     Volapuk                   Latin                -            -

wa     Walloon                   Latin                -            Incomplete

yi     Yiddish                   Hebrew               -            -
yo     Yoruba                    Latin                -            -

zu     Zulu                      Latin                -            -

B.2 Languages in Which the Exact Script Used in Unknown
=======================================================

Aspell can most likely support any of the following languages; however,
I am unsure what script they are written in.  Most of them are probably
written in Latin but I am not sure.  If you have any information about
these languages please email me at <kevina at gnu.org>.

Code Language Name

aa   Afar
ak   Akan
av   Avaric

bi   Bislama
bm   Bambara

ee   Ewe

ff   Fulah

ho   Hiri Motu
ht   Haitian Creole
hz   Herero

ie   Interlingue
ig   Igbo
ii   Sichuan Yi
ik   Inupiaq

kg   Kongo
ki   Kikuyu/Gikuyu
kj   Kwanyama

lb   Luxembourgish
lg   Ganda
li   Limburgan
ln   Lingala
lu   Luba-Katanga

mg   Malagasy
mh   Marshallese

na   Nauru
nb   Norwegian Bokmal
nd   North Ndebele
ng   Ndonga
nn   Nynorsk, Norwegian
nr   South Ndebele
ny   Nyanja

ps   Pushto

rn   Rundi
rw   Kinyarwanda

sc   Sardinian
se   Northern Sami
sg   Sango
si   Sinhalese
sm   Samoan
ss   Swati
st   Southern Sotho

tn   Tswana
to   Tonga
ts   Tsonga
tw   Twi

ve   Venda

wo   Wolof

xh   Xhosa

za   Zhuang

B.3 The Ethiopic Script
=======================

Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it.  The idea is to split
each character into two parts based on the matrix representation.  The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'.  The combined character will then be mapped with the upper
bits coming first.  Thus each Ethiopic syllabary will have the form
`11?????? 10000???'.  By mapping the first and second parts to separate
8-bit characters it is easy to which part represents the consonant and
which part represents the vowel of the syllabary.  This encoding of the
syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16.  In fact, the exiting suggestion strategy of Aspell
will well with this encoding with out any additional modifications.
However, additional improvements may be possible by taking advantage of
the consonant-vowel structure of this encoding.

   In fact, the split consonant-vowel may prove to be so useful that it
may be beneficial to encode other syllabary in this fashion, even if
they are less than 220 of them.

   The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.51.  However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.

B.4 The Thai Script
===================

The Thai script presents a different problem for Aspell.  The problem
is not that there are more than 220 unique symbols, but that there are
no spaces between words.  This means that there is no easy way to split
a sentence into individual words.  However, it is still possible to
spell check Thai, it is just a lot more difficult.  I will be happy to
work within someone who is interested in adding Thai support to Aspell,
but it is not likely something I will do in the foreseeable future.

B.5 Languages which use Hànzi Characters
========================================

Hànzi Characters are used to write Chinese, Japanese, Korean, and were
once used to write Vietnamese.  Each hànzi character represents a of a
spoken word and also has a meaning.  Since there are around 3,000 of
them in common usage it is unlikely that Aspell will ever be able to
support spell checking languages written using hànzi.  However, I am
not even sure if these languages need spell checking since hànzi
characters are generally not entered in directly.  Furtherer even if
Aspell could spell check hànzi the exiting suggestion strategy will not
work well at all, and thus a completely new strategy will need to be
developed.

B.6 Japanese
============

Modern Japanese is written in a mixture of hiragana, katakana, kanji,
and sometimes romaji.  Hiragana, Katakana are both syllabary unique to
japan, kanji is a modified form of hànzi, and romaji uses the Latin
alphabet.  With some work, Aspell should be able to check the non-kanji
part of Japanese text.  However, based on my limiting understanding of
Japanese hiragana is often used at the end of kanji.  Thus if Aspell
was to simply separate out the hiragana from kanji it would end up with
a lot of word endings which are not proper words and will thus be
flagged as misspellings.

B.7 Languages Written in Multiple Scripts
=========================================

Aspell should be able to check text written in the same language but in
multiple scripts with some work.  If the number of unique symbols in
both scripts is less than 220 than a special character set can be used
to allow both scripts to be encoding in the same dictionary.  However
this may not be the most efficient solution.  An alternate solution is
to store each script in its own dictionary and allow Aspell to chose
the correct dictionary based on which script the given word is written
in.  Aspell currently does not support this mode of spell checking
however it is something that I hope to eventually support.
-- 
http://kevin.atkinson.dhs.org




More information about the utf-8 mailing list