[utf-8] Languages Which Aspell can to Support
Kevin Atkinson
kevin at atkinson.dhs.org
Tue Mar 16 22:28:09 PST 2004
Since everyone gave me such a hard time about my decision to keep Aspell
8-bit internally here is the text of an Appendix I just added to the
Aspell manual.
Appendix B Languages Which Aspell can to Support
************************************************
Even though Aspell will remain 8-bit internally it should still be be
able to support any written languages not based on a logographic
script. A only logographic writing system in current use are those
based on hanzi which includes Chinese, Japanese, and sometimes Korean.
B.1 Languages with 220 or Fewer Unique Symbols
==============================================
Aspell 0.51 should be able to support the following languages as, to
the best of my knowledge, they all contain 220 or fewer symbols and can
thus, fit within an 8-bit character set. If an existing character set
does not exists than a new one can be invented. This is true even if
the script is not yet supported by Unicode as the private use area can
be used.
Code Language Name Script Dictionary Gettext
Available Translation
ab Abkhazian Cyrillic - -
ae Avestan Avestan - -
af Afrikaans Latin Yes -
an Aragonese Latin - -
ar Arabic Arabic - -
as Assamese Bengali - -
ay Aymara Latin - -
az Azerbaijani Arabic - -
az Cyrillic - -
az Latin - -
ba Bashkir Cyrillic - -
be Belarusian Cyrillic - Yes
bg Bulgarian Cyrillic Yes -
bh Bihari Devanagari - -
bn Bengali Bengali - -
bo Tibetan Tibetan - -
br Breton Latin Yes -
bs Bosnian Latin - -
ca Catalan/Valencian Latin Yes -
ce Chechen Cyrillic - -
ch Chamorro Latin - -
co Corsican Latin - -
cr Cree Canadian Syllabics - -
cr Latin - -
cs Czech Latin Yes -
cv Chuvash Cyrillic - -
cy Welsh Latin Yes -
da Danish Latin Yes -
de German Latin Yes -
dv Divehi Dhives Akuru - -
dz Dzongkha Tibetan - -
el Greek Greek Yes -
en English Latin Yes -
eo Esperanto Latin Yes -
es Spanish Latin Yes Incomplete
et Estonian Latin - -
eu Basque Latin - -
fa Persian Arabic - -
fi Finnish Latin - -
fj Fijian Latin - -
fo Faroese Latin Yes -
fr French Latin Yes Yes
fy Frisian Latin - -
ga Irish Latin Yes Yes
gd Scottish Gaelic Latin - -
gl Gallegan Latin Yes -
gn Guarani Latin - -
gu Gujarati Gujarati - -
gv Manx Latin - -
ha Hausa Latin - -
he Hebrew Hebrew - -
hi Hindi Devanagari - -
hr Croatian Latin Yes -
hu Hungarian Latin - -
hy Armenian Armenian - -
id Indonesian Arabic - -
id Latin Yes -
io Ido Latin - -
is Icelandic Latin Yes -
it Italian Latin Yes -
iu Inuktitut Canadian Syllabics - -
iu Latin - -
ja Japanese Latin - -
jv Javanese Javanese - -
jv Latin - -
ka Georgian Georgian - -
kk Kazakh Cyrillic - -
kl Kalaallisut/Greenlandic Latin - -
km Khmer Khmer - -
kn Kannada Kannada - -
ko Korean Hangeul - -
kr Kanuri Latin - -
ks Kashmiri Arabic - -
ks Devanagari - -
ku Kurdish Arabic - -
ku Cyrillic - -
ku Latin - -
kv Komi Cyrillic - -
kw Cornish Latin - -
ky Kirghiz Arabic - -
ky Cyrillic - -
ky Latin - -
la Latin Latin - -
lo Lao Lao - -
lt Lithuanian Latin - -
lv Latvian Latin - -
mi Maori Latin Yes -
mk Makasar Lontara/Makasar - -
ml Malayalam Latin - -
ml Malayalam - -
mn Mongolian Cyrillic - -
mn Mongolian - -
mo Moldavian Cyrillic - -
mr Marathi Devanagari - -
ms Malay Arabic - -
ms Latin Yes -
mt Maltese Latin - -
my Burmese Myanmar - -
ne Nepali Devanagari - -
nl Dutch Latin Yes Yes
no Norwegian Latin Yes -
nv Navajo Latin - -
oc Occitan/Provencal Latin - -
oj Ojibwa Ojibwe - -
or Oriya Oriya - -
os Ossetic Cyrillic - -
pa Punjabi Gurmukhi - -
pi Pali Devanagari - -
pi Sinhala - -
pl Polish Latin Yes -
pt Portuguese Latin Yes -
qu Quechua Latin - -
rm Raeto-Romance Latin - -
ro Romanian Latin Yes -
ru Russian Cyrillic Yes -
sa Sanskrit Devanagari - -
sa Sinhala - -
sd Sindhi Arabic - -
sk Slovak Latin Yes -
sl Slovenian Latin Yes -
sn Shona Latin - -
so Somali Latin - -
sq Albanian Latin - -
sr Serbian Cyrillic - Yes
su Sundanese Latin - -
sv Swedish Latin Yes -
sw Swahili Latin - -
ta Tamil Tamil - -
te Telugu Telugu - -
tg Tajik Arabic - -
tg Cyrillic - -
tg Latin - -
tk Turkmen Arabic - -
tk Cyrillic - -
tk Latin - -
tl Tagalog Latin - -
tl Tagalog - -
tr Turkish Arabic - -
tr Latin - -
tt Tatar Cyrillic - -
ty Tahitian Latin - -
ug Uighur Arabic - -
ug Cyrillic - -
ug Latin - -
ug Uyghur - -
uk Ukrainian Cyrillic Yes -
ur Urdu Arabic - -
uz Uzbek Cyrillic - -
uz Latin - -
vi Vietnamese Latin - -
vo Volapuk Latin - -
wa Walloon Latin - Incomplete
yi Yiddish Hebrew - -
yo Yoruba Latin - -
zu Zulu Latin - -
B.2 Languages in Which the Exact Script Used in Unknown
=======================================================
Aspell can most likely support any of the following languages; however,
I am unsure what script they are written in. Most of them are probably
written in Latin but I am not sure. If you have any information about
these languages please email me at <kevina at gnu.org>.
Code Language Name
aa Afar
ak Akan
av Avaric
bi Bislama
bm Bambara
ee Ewe
ff Fulah
ho Hiri Motu
ht Haitian Creole
hz Herero
ie Interlingue
ig Igbo
ii Sichuan Yi
ik Inupiaq
kg Kongo
ki Kikuyu/Gikuyu
kj Kwanyama
lb Luxembourgish
lg Ganda
li Limburgan
ln Lingala
lu Luba-Katanga
mg Malagasy
mh Marshallese
na Nauru
nb Norwegian Bokmal
nd North Ndebele
ng Ndonga
nn Nynorsk, Norwegian
nr South Ndebele
ny Nyanja
ps Pushto
rn Rundi
rw Kinyarwanda
sc Sardinian
se Northern Sami
sg Sango
si Sinhalese
sm Samoan
ss Swati
st Southern Sotho
tn Tswana
to Tonga
ts Tsonga
tw Twi
ve Venda
wo Wolof
xh Xhosa
za Zhuang
B.3 The Ethiopic Script
=======================
Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it. The idea is to split
each character into two parts based on the matrix representation. The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'. The combined character will then be mapped with the upper
bits coming first. Thus each Ethiopic syllabary will have the form
`11?????? 10000???'. By mapping the first and second parts to separate
8-bit characters it is easy to which part represents the consonant and
which part represents the vowel of the syllabary. This encoding of the
syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell
will well with this encoding with out any additional modifications.
However, additional improvements may be possible by taking advantage of
the consonant-vowel structure of this encoding.
In fact, the split consonant-vowel may prove to be so useful that it
may be beneficial to encode other syllabary in this fashion, even if
they are less than 220 of them.
The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.51. However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.
B.4 The Thai Script
===================
The Thai script presents a different problem for Aspell. The problem
is not that there are more than 220 unique symbols, but that there are
no spaces between words. This means that there is no easy way to split
a sentence into individual words. However, it is still possible to
spell check Thai, it is just a lot more difficult. I will be happy to
work within someone who is interested in adding Thai support to Aspell,
but it is not likely something I will do in the foreseeable future.
B.5 Languages which use Hànzi Characters
========================================
Hànzi Characters are used to write Chinese, Japanese, Korean, and were
once used to write Vietnamese. Each hànzi character represents a of a
spoken word and also has a meaning. Since there are around 3,000 of
them in common usage it is unlikely that Aspell will ever be able to
support spell checking languages written using hànzi. However, I am
not even sure if these languages need spell checking since hànzi
characters are generally not entered in directly. Furtherer even if
Aspell could spell check hànzi the exiting suggestion strategy will not
work well at all, and thus a completely new strategy will need to be
developed.
B.6 Japanese
============
Modern Japanese is written in a mixture of hiragana, katakana, kanji,
and sometimes romaji. Hiragana, Katakana are both syllabary unique to
japan, kanji is a modified form of hànzi, and romaji uses the Latin
alphabet. With some work, Aspell should be able to check the non-kanji
part of Japanese text. However, based on my limiting understanding of
Japanese hiragana is often used at the end of kanji. Thus if Aspell
was to simply separate out the hiragana from kanji it would end up with
a lot of word endings which are not proper words and will thus be
flagged as misspellings.
B.7 Languages Written in Multiple Scripts
=========================================
Aspell should be able to check text written in the same language but in
multiple scripts with some work. If the number of unique symbols in
both scripts is less than 220 than a special character set can be used
to allow both scripts to be encoding in the same dictionary. However
this may not be the most efficient solution. An alternate solution is
to store each script in its own dictionary and allow Aspell to chose
the correct dictionary based on which script the given word is written
in. Aspell currently does not support this mode of spell checking
however it is something that I hope to eventually support.
--
http://kevin.atkinson.dhs.org
More information about the utf-8
mailing list