[utf-8] Language Info Needed for GNU Aspell
Kevin Atkinson
kevina at gnu.org
Tue Mar 23 12:43:30 PST 2004
[Please distribute this document as widely as possible.]
GNU Aspell 0.60 should be able to support most of the Word Languages.
This includes languages languages written in Arabic and other scripts
not well supported by an existing 8-bit character set. Eventually
Aspell should be able to support any current language not based on the
Chinese writing system.
GNU Aspell is a spell checker designed to eventually replace Ispell.
Its main feature is that it does a much better job of coming up with
possible suggestions than just about any other spell checker out there
for the English language, including Ispell and Microsoft Word.
However, starting with Aspell 0.60 is should also be the only Free (as
in Freedom) that can support most languages not written in the Latin or
Cyrillic scripts.
However I, the author of Aspell, know very little about foreign
languages (ie non-English) and what it takes to correctly spell check
them. Thus, I need other people to educate me.
If you speak a foreign language I would appreciate if you would take
the time too look over the following material and email me with any
additional information you may have.
The first part gives a thorough analysis of the languages which Aspell
can and cannot support. If you find any of this information is
incorrect please inform me at kevina at gnu.org.
When Aspell 0.60 is released I would like to have dictionaries
available for as many languages as possible.
Therefore, if you know of a Free word list available for a language that
is not currently listed as having a dictionary available I would
appreciate hearing form you. I am especially interested in working
with someone to add support for languages written in the Arabic
script. The encoding of the Arabic is quite complicated and I want to
be sure that Aspell can correctly handle it.
I would also appreciate some help converting Ispell dictionaries to
Aspell. So, if you would like to help convert some of the dictionaries
listed as being available for Ispell please contact me.
The second part lists languages related issues involved in correctly
spell checking a document. If you can offer any additional insight on
any of the issues discussed, or know of any additional complications
when spell checking a given language, I would appreciate hearing from
you.
The last part discusses why Aspell uses 8-bit characters internally
for your reading pleasure.
All of this material is also included in the Aspell 0.60 manual which
you can find at http://aspell.net/devel-doc/man.
Languages Which Aspell can Support
**********************************
Even though Aspell will remain 8-bit internally it should still be be
able to support any written languages not based on a logographic
script. The only logographic writing system in current use are those
based on hànzi which includes Chinese, Japanese, and sometimes Korean.
Supported
=========
Aspell 0.60 should be able to support the following languages as, to the
best of my knowledge, they all contain 220 or fewer symbols and can
thus, fit within an 8-bit character set. If an existing character set
does not exists than a new one can be invented. This is true even if the
script is not yet supported by Unicode as the private use area can be
used.
Code Language Name Script Dictionary Gettext
Available Translation
aa Afar Latin - -
ab Abkhazian Cyrillic - -
ae Avestan Avestan - -
af Afrikaans Latin Yes -
ak Akan Latin - -
an Aragonese Latin - -
ar Arabic Arabic - -
as Assamese Bengali - -
av Avar Cyrillic - -
ay Aymara Latin - -
az Azerbaijani Cyrillic - -
az Latin - -
ba Bashkir Cyrillic - -
be Belarusian Cyrillic Planned Yes
bg Bulgarian Cyrillic Yes -
bh Bihari Devanagari - -
bi Bislama Latin - -
bm Bambara Latin - -
bn Bengali Bengali Planned -
bo Tibetan Tibetan - -
br Breton Latin Yes -
bs Bosnian Latin - -
ca Catalan/Valencian Latin Yes -
ce Chechen Cyrillic - -
ch Chamorro Latin - -
co Corsican Latin - -
cr Cree Latin - -
cs Czech Latin Yes Yes
cu Old Slavonic Cyrillic - -
cv Chuvash Cyrillic - -
cy Welsh Latin Yes -
da Danish Latin Yes -
de German Latin Yes Yes
dv Divehi Thaana - -
dz Dzongkha Tibetan - -
ee Ewe Latin - -
el Greek Greek Yes -
en English Latin Yes Yes
eo Esperanto Latin Yes -
es Spanish Latin Yes Incomplete
et Estonian Latin Planned -
eu Basque Latin - -
fa Persian Arabic - -
ff Fulah Latin - -
fi Finnish Latin Planned -
fj Fijian Latin - -
fo Faroese Latin Yes -
fr French Latin Yes Yes
fy Frisian Latin - -
ga Irish Latin Yes Yes
gd Scottish Gaelic Latin Planned -
gl Gallegan Latin Yes -
gn Guarani Latin - -
gu Gujarati Gujarati - -
gv Manx Latin Planned -
ha Hausa Latin - -
he Hebrew Hebrew Planned -
hi Hindi Devanagari - -
ho Hiri Motu Latin - -
hr Croatian Latin Yes -
ht Haitian Creole Latin - -
hu Hungarian Latin Planned -
hy Armenian Armenian - -
hz Herero Latin - -
ia Interlingua (IALA) Latin Yes -
id Indonesian Latin Yes -
ie Interlingue Latin - -
ig Igbo Latin - -
ik Inupiaq Latin - -
io Ido Latin - -
is Icelandic Latin Yes -
it Italian Latin Yes -
iu Inuktitut Latin - -
jv Javanese Javanese - -
jv Latin - -
ka Georgian Georgian - -
kg Kongo Latin - -
ki Kikuyu/Gikuyu Latin - -
kj Kwanyama Latin - -
kk Kazakh Cyrillic - -
kl Kalaallisut/Greenlandic Latin - -
kn Kannada Kannada - -
ko Korean Hangeul - -
kr Kanuri Latin - -
ks Kashmiri Arabic - -
ks Devanagari - -
ku Kurdish Arabic - -
ku Cyrillic - -
ku Latin - -
kv Komi Cyrillic - -
kw Cornish Latin - -
ky Kirghiz Cyrillic - -
ky Latin - -
la Latin Latin - -
lb Luxembourgish Latin Planned -
lg Ganda Latin - -
li Limburgan Latin - -
ln Lingala Latin - -
lo Lao Lao - -
lt Lithuanian Latin Planned -
lu Luba-Katanga Latin - -
lv Latvian Latin - -
mg Malagasy Latin - -
mh Marshallese Latin - -
mi Maori Latin Yes -
mk Makasar Lontara/Makasar - -
ml Malayalam Latin - -
ml Malayalam - -
mn Mongolian Cyrillic - -
mn Mongolian - -
mo Moldavian Cyrillic - -
mr Marathi Devanagari - -
ms Malay Latin Yes -
mt Maltese Latin Planned -
my Burmese Myanmar - -
na Nauruan Latin - -
nb Norwegian Bokmal Latin Yes -
nd North Ndebele Latin - -
ne Nepali Devanagari - -
ng Ndonga Latin - -
nl Dutch Latin Yes Yes
nn Norwegian Nynorsk Latin Yes -
nr South Ndebele Latin - -
nv Navajo Latin - -
ny Nyanja Latin - -
oc Occitan/Provencal Latin - -
or Oriya Oriya - -
os Ossetic Cyrillic - -
pa Punjabi Gurmukhi - -
pi Pali Devanagari - -
pi Sinhala - -
pl Polish Latin Yes -
ps Pushto Arabic - -
pt Portuguese Latin Yes Yes
qu Quechua Latin - -
rm Raeto-Romance Latin - -
rn Rundi Latin - -
ro Romanian Latin Yes Yes
ru Russian Cyrillic Yes Yes
rw Kinyarwanda Latin - -
sa Sanskrit Devanagari - -
sc Sardinian Latin - -
sd Sindhi Arabic - -
se Northern Sami Latin - -
sg Sango Latin - -
si Sinhalese Sinhala - -
sk Slovak Latin Yes -
sl Slovenian Latin Yes -
sm Samoan Latin - -
sn Shona Latin - -
so Somali Latin - -
sq Albanian Latin Planned -
sr Serbian Cyrillic - Yes
sr Latin - -
ss Swati Latin - -
st Southern Sotho Latin - -
su Sundanese Latin - -
sv Swedish Latin Yes -
sw Swahili Latin Planned -
ta Tamil Tamil Planned -
te Telugu Telugu - -
tg Tajik Latin - -
tk Turkmen Latin - -
tl Tagalog Latin - -
tl Tagalog - -
tn Tswana Latin - -
to Tonga Latin - -
tr Turkish Latin - -
ts Tsonga Latin - -
tt Tatar Cyrillic - -
tw Twi Latin - -
ty Tahitian Latin - -
ug Uighur Arabic - -
ug Cyrillic - -
ug Latin - -
uk Ukrainian Cyrillic Yes -
ur Urdu Arabic - -
uz Uzbek Cyrillic - -
uz Latin - -
ve Venda Latin - -
vi Vietnamese Latin - -
vo Volapuk Latin - -
wa Walloon Latin Planned Incomplete
wo Wolof Latin - -
xh Xhosa Latin - -
yi Yiddish Hebrew - -
yo Yoruba Latin - -
za Zhuang Latin - -
zu Zulu Latin Planned -
Notes on Latin Languages
------------------------
Any word that can be written using on of the Latin ISO-8859 character
sets (ISO-8859-1,2,3,4,9,10,13,14,15,16) can be written, in decomposed
form, using the ASCII characters, the 23 additional letters:
U+00C6 LATIN CAPITAL LETTER AE
U+00D0 LATIN CAPITAL LETTER ETH
U+00D8 LATIN CAPITAL LETTER O WITH STROKE
U+00DE LATIN CAPITAL LETTER THORN
U+00DE LATIN SMALL LETTER THORN
U+00DF LATIN SMALL LETTER SHARP S
U+00E6 LATIN SMALL LETTER AE
U+00F0 LATIN SMALL LETTER ETH
U+00F8 LATIN SMALL LETTER O WITH STROKE
U+0110 LATIN CAPITAL LETTER D WITH STROKE
U+0111 LATIN SMALL LETTER D WITH STROKE
U+0126 LATIN CAPITAL LETTER H WITH STROKE
U+0127 LATIN SMALL LETTER H WITH STROKE
U+0131 LATIN SMALL LETTER DOTLESS I
U+0138 LATIN SMALL LETTER KRA
U+0141 LATIN CAPITAL LETTER L WITH STROKE
U+0142 LATIN SMALL LETTER L WITH STROKE
U+014A LATIN CAPITAL LETTER ENG
U+014B LATIN SMALL LETTER ENG
U+0152 LATIN CAPITAL LIGATURE OE
U+0153 LATIN SMALL LIGATURE OE
U+0166 LATIN CAPITAL LETTER T WITH STROKE
U+0167 LATIN SMALL LETTER T WITH STROKE
and the 14 modifiers:
U+0300 COMBINING GRAVE ACCENT
U+0301 COMBINING ACUTE ACCENT
U+0302 COMBINING CIRCUMFLEX ACCENT
U+0303 COMBINING TILDE
U+0304 COMBINING MACRON
U+0306 COMBINING BREVE
U+0307 COMBINING DOT ABOVE
U+0308 COMBINING DIAERESIS
U+030A COMBINING RING ABOVE
U+030B COMBINING DOUBLE ACUTE ACCENT
U+030C COMBINING CARON
U+0326 COMBINING COMMA BELOW
U+0327 COMBINING CEDILLA
U+0328 COMBINING OGONEK
Which is a total of 37 additional Unicode code points.
All ISO-8859 character leaves the characters 0x00 - 0x19 and 0x80 -
0x99 unmapped as they are generally used as control characters. Of
those, 0x02 - 0x19 and 0x80 - 0x99 may be mapped to anything in Aspell.
This is a total of 62 characters which can be remapped in any ISO-8859
character set. Thus, by remapping 37 of the 62 characters to the
previously specifed Unicode code-points, any modified ISO-8859 character
set can be used for any Latin languages covered by ISO-8859. Of course
decomposing every single accented character wastes a lot of space, so
only characters that can be not be represented in the precomposed form
should be broken up. By using this trick it is possible to store
foreign words in the correctly accented form in the dictionary even if
the precomposed character is not in the current character set.
Any letter in the Unicode range U+0000 - U+0249, U+1E00..U+1EFF
(Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B,
and Latin Extended Additional) can be represented using around 175
basic letters, and 25 modifiers which is less than 220 and can thus fit
in an Aspell 8-bit character set. Since this unicode range covers any
possible Latin language this special character set can be used to
reperesnt any word written using the Latin script if so desired.
Hangeul
-------
Koren in generally written in hangeul or a mixture of hanja and hangeul.
Aspell should be able to spell check the hangeul part of the writing.
In Hangeul letters individual letters, known as jamo, are grouped
together in syllable blocks. Unicode provided code points for both jamo
and the combined syllable block. The syllable blocks will need to be
decomposed into jamo in order for Aspell to spell check it.
Syllabic
========
Syllabic languages use a separate symbol for each syllable of the
language. Since most of them have more than 240 distinct characters
Aspell can not support them as is. However, all hope is not lost as
Aspell will most likely be able to support them in the future.
Code Language Name Script
am Amharic Ethiopic
cr Cree Canadian Syllabics
ii Sichuan Yi Yi
iu Inuktitut Canadian Syllabics
oj Ojibwa Ojibwe
om Oromo Ethiopic
ti Tigrinya Ethiopic
The Ethiopic Syllabary
----------------------
Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it. The idea is to split
each character into two parts based on the matrix representation. The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'. The combined character will then be mapped with the upper
bits coming first. Thus each Ethiopic syllabary will have the form
`11?????? 10000???'. By mapping the first and second parts to separate
8-bit characters it is easy to tell which part represents the consonant
and which part represents the vowel of the syllabary. This encoding of
the syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell
will work well with this encoding with out any additional
modifications. However, additional improvements may be possible by
taking advantage of the consonant-vowel structure of this encoding.
In fact, the split consonant-vowel representation may prove to be so
useful that it may be beneficial to encode other syllabary in this
fashion, even if they are less than 220 of them.
The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.60. However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.
The Yi Syllabary
----------------
A very large syllabary with 819 distince symbols. However, like
Ethiopic, it should be possible to support this script by breaking it
up.
The Unified Canadian Aboriginal Syllabics
-----------------------------------------
Another very large syllabary.
The Ojibwe Syllabary
--------------------
With only 120 distinct symbols, Aspell can actually support this one as
is. However, as previously mentioned, it may be beneficial to break it
up into the consonant-vowel representation anyway.
Unsupported
===========
These languages, when written in the given script, are currently
unsupported by Aspell for one reason or another.
Code Language Name Script
ja Japanese Japanese
km Khmer Khmer
ko Korean Hanja + Hangeul
pi Pali Thai
th Thai Thai
zh Chinese Hanja
The Thai and Khmer Scripts
--------------------------
The Thai and Khmer scripts presents a different problem for Aspell. The
problem is not that there are more than 220 unique symbols, but that
there are no spaces between words. This means that there is no easy way
to split a sentence into individual words. However, it is still
possible to spell check these scripts, it is just a lot more difficult.
I will be happy to work within someone who is interested in adding Thai
or Khmer support to Aspell, but it is not likely something I will do in
the foreseeable future.
Languages which use Hànzi Characters
------------------------------------
Hànzi Characters are used to write Chinese, Japanese, Korean, and were
once used to write Vietnamese. Each hànzi character represents a
syllable of a spoken word and also has a meaning. Since there are
around 3,000 of them in common usage it is unlikely that Aspell will
ever be able to support spell checking languages written using hànzi.
However, I am not even sure if these languages need spell checking since
hànzi characters are generally not entered in directly. Furthermore
even if Aspell could spell check hànzi the exiting suggestion strategy
will not work well at all, and thus a completely new strategy will need
to be developed.
Japanese
--------
Modern Japanese is written in a mixture of "hiragana", "katakana",
"kanji", and sometimes "romaji". "Hiragana" and "katakana" are both
syllabaries unique to Japan, "kanji" is a modified form of hànzi, and
"romaji" uses the Latin alphabet. With some work, Aspell should be
able to check the non-kanji part of Japanese text. However, based on
my limited understanding of Japanese hiragana is often used at the end
of kanji. Thus if Aspell was to simply separate out the hiragana from
kanji it would end up with a lot of word endings which are not proper
words and will thus be flagged as misspellings. However, this can be
fairly easily rectified as text is tokenized into words before it is
converted into Aspell's internal encoding. In fact, some Japanese text
is written in entirely in one script. For example books for children
and foreigners are sometimes written entirely in hiragana. Thus,
Aspell could prove at least somewhat useful for spell checking Japanese.
Languages Written in Multiple Scripts
=====================================
Aspell should be able to check text written in the same language, but in
multiple scripts, with some work. If the number of unique symbols in
both scripts is less than 220 than a special character set can be used
to allow both scripts to be encoding in the same dictionary. However
this may not be the most efficient solution. An alternate solution is
to store each script in its own dictionary and allow Aspell to chose
the correct dictionary based on which script the given word is written
in. Aspell currently does not support this mode of spell checking
however it is something that I hope to eventually support.
Notes on Planned Dictionaries
=============================
be Belarusian Ispell Dictionary available
bn Bengali Unoffical Aspell Dictionary available
`http://www.bengalinux.org/downloads/'
et Estonian Ispell Dictionary available
fi Finnish Ispell Dictionary available
gd Scottish Ispell Dictionary available.
Gaelic `http://packages.debian.org/unstable/text/igaelic'
gv Manx Ispell Dictionary available.
`http://packages.debian.org/unstable/text/imanx'
he Hebrew Ispell Dictionary available
hu Hungarian MySpell dictionary expanded to over 500 MB. Will add
once affix support is worked into the dictionary
package system.
lb Luxembourgish MySpell dictionary planned.
lt Lithuanian MySpell dictionary expanded to over 500 MB. Will add
once affix support is worked into the dictionary
package system.
mt Maltese Unofficial Aspell Dictionary available, but broken
link to source.
`http://linux.org.mt/article/spellcheck'
sw Albanian Ispell Dictionary available
sw Swahili Available at
`http://sourceforge.net/projects/translate'. Offical
version comming soon.
ta Tamil Word list available at
`http://www.developer.thamizha.com/spellchecker/index.html'.
Working with them to create an Aspell dictionary.
wa Walloon Ispell Dictionary available
zu Zulu Available at
`http://sourceforge.net/projects/translate'. Offical
version comming soon.
References
==========
The information in this chapter was gathered from numerous sources,
including:
* ISO 639-2 Registration Authority,
`http://www.loc.gov/standards/iso639-2/'
* Languages and Scripts (Offical Unicode Site),
`http://www.unicode.org/onlinedat/languages-scripts.html'
* Omniglot - a guide to written language, `http://www.omniglot.com/'
* Winkipedia - The Free Encyclopedia, `http://wikipedia.org/'
* Ethnologue - Languages of the World, `http://www.ethnologue.com/'
* World Languages - The Ultimate Language Store,
`http://www.worldlanguage.com/'
* South African Languages Web, `http://www.languages.web.za/'
* The Languages and Writing Systems of Africa (Global Advisor
Newsletter), `http://www.intersolinc.com/newsletters/africa.htm'
Special thanks goes to Era Eriksson for helping me the information in
this chapter.
Language Related Issues
***********************
Here are some language related issues that a good spell checker needs to
handle. If you have any more information about any of these issues, or
of a new issue not discussed here, please email me at <kevina at gnu.org>.
German Sharp S
==============
The German Sharp S or Eszett does not have an uppercase equivalent.
Instead when `ß' is converted to `SS'. The conversion of `ß' to `SS'
requires a special rule, and increases the length of a word, thus
disallowing inplace case conversion. Furthermore, my general rule of
converting all words to lowercase before looking them up in the
dictionary won't work because the conversion of `SS' to lowercase is
ambiguous; it can be `ss' or `ß'. I do plan on dealing with this
eventually, however.
Compound Words
==============
In some languages, such as German, it is acceptable to string two words
together, thus forming a compound word. However, there are rules to
when this can be done. Furthermore, it is not always sufficient to
simply concatenate the two words. For example, sometimes a letter is
inserted between the two words. I tried implementing support for
compound words in Aspell but it was too limiting and no one used it.
Before I try implementing it again I want to know all the issues
involved.
Context Sensitive Spelling
==========================
In some language, such as Luxembourgish, the spelling of a word depends
on which words surround it. For example the the letter `n' at the end
of a word will disappear if it is followed by another word starting
with a certain letter such as an `s'. However, it can probably get
more complicated than that. I would like to know how complicated before
I attempt to implement support for context sensitive spelling.
Unicode Normalization
=====================
Because Unicode contains a large number of precomposed characters there
are multiple ways a character can be represented. For example letter
a* can either be represented as
U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
or
U+0061 LATIN SMALL LETTER A + U+030A COMBINING RING ABOVE
By performing normalization first Aspell will only see one of these
representations. The exact form of normalization depends on the
language. Give the choice of
1. Precomposed character
2. Base letter + combining character(s)
3. Base letter only
if the precomposed charter is in the target character set then (1), if
both the base and combing character is present than (2), otherwise (3).
Words With Spaces or other Symbols in Them
==========================================
Many languages, including English, have words with non-letter symbols in
them. For example the apostrophe. These symbols generally appear in
the middle of a word, but they can also appear at the end, such as in an
abbreviation. If a symbol can _only_ appear as part of a word than
Aspell can treat it as if it were a letter.
However, the problem is most of these symbols have other uses. For
example, the apostrophe is often used as a single quote and the
abbreviations marker is also used as a period. Thus, Aspell can not
blindly treat them as if they were letters.
Aspell currently handles the case where the symbol can only appear in
the middle of the word fairly well. It simply assumes that if there is
a letter both before and after the symbol than it is part of the word.
This works most of the time but it is not fool proof. For example,
suppose the user forgot to leave a space after the period:
... and the dog went up the tree.Then the cat ...
Aspell would think "tree.Then" is one word. A better solution might be
to then try to check "tree" and "Then" separately. But what if one of
them is not in the dictionary? Should Aspell assume "tree.Then" is one
word?
The case where the symbol can appear at the beginning or end of the
word is more difficult to deal with. The symbol may or may not
actually be part of the word. Aspell currently handles this case by
first trying to spell check the word with the symbol and if that fails,
try it without. The problem is, if the word is misspelled, should
Aspell assume the symbol belongs with the word or not? Currently
Aspell assumes it does, which is not always the correct thing to do.
Numbers in words present a different challenge to Aspell. If Aspell
treats numbers as letters than every possible number a user might write
in a document must be specified in the dictionary. This could be
easily be solved by having special code to assume all numbers are
correctly spelled. But what about something like "4th". Since the
"th" suffix can appear after any number we are left with the same
problem. The solution would be to have a special symbol for "any
number".
Words with spaces in them, such as foreign phrases, are even more
trouble to deal with. The basic problem is that when tokonizing a
string there is no good way to keep phrases together. One solution is to
use trial and error. If a word is not in the dictionary try grouping it
with the previous or next word and see if the combined word is the
dictionary. But what if the combined word is not, should the misspelled
word be grouped when looking for suggestions? One solution is to also
store each part of the phrase in the dictionary, but tag it as part of a
phrase and not an independent word.
To further complicate things, most applications that use spell
checkers are accustom to parsing the document themselves and sending it
to the spell checker a word at a time. In order to support word with
spaces in them a more complicated interface will be required.
Notes on 8-bit Characters
*************************
There is a very good reason I use 8-bit characters in Aspell. Speed and
simplicity. While many parts of my code can fairly be easily be
converted to some sort of wide character as my code is clean. Other
parts can not be.
One of the reasons because is many, many places I use a direct lookup
to find out various information about characters. With 8-bit characters
this is very feasible because there is only 256 of them. With 16-bit
wide characters this will waste a LOT of space. With 32-bit characters
this is just plain impossible. Converting the lookup tables to some
other form, while certainly possible, will degrade performance
significantly.
Furthermore, some of my algorithms relay on words consisting only on
a small number of distinct characters (often around 30 when case and
accents are not considered). When the possible character can consist of
any Unicode character this number because several thousand, if that. In
order for these algorithms to still be used some sort of limit will
need to be placed on the possible characters the word can contain. If I
impose that limit, I might as well use some sort of 8-bit characters
set which will automatically place the limit on what the characters can
be.
There is also the issue of how I should store the word lists in
memory? As a string of 32 bit wide characters. Now that is using up 4
times more memory than charters would and for languages that can fit
within an 8-bit character that is, in my view, a gross waste of memory.
So maybe I should store them is some variable width format such as
UTF-8. Unfortunately, way, way to many of may algorithms will simply
not work with variable width characters without significant
modification which will very likely degrade performance. So the
solution is to work with the characters as 32-bit wide characters and
than convert it to a shorter representation when storing them in the
lookup tables. Now than can lead to an inefficiency. I could also use
16 bit wide characters however that may not be good enough to hold all
of future versions of Unicode and it has the same problems.
As a response to the space waste used by storing word lists in some
sort of wide format some one asked:
Since hard drive are cheaper and cheaper, you could store
dictionary in a usable (uncompressed) form and use it directly
with memory mapping. Then the efficiency would directly depend on
the disk caching method, and only the used part of the
dictionaries would relay be loaded into memory. You would no more
have to load plain dictionaries into main memory, you'll just want
to compute some indexes (or something like that) after mapping.
However, the fact of the matter is that most of the dictionary will
be read into memory anyway if it is available. If it is not available
than there would be a good deal of disk swaps. Making characters 32-bit
wide will increase the change that there are more disk swap. So the
bottom line is that it will be cheaper to convert the characters from
something like UTF-8 into some sort of wide character. I could also use
some sort of disk space lookup table such as the Berkeley Database.
However this will *definitely* degrade performance.
The bottom line is that keeping Aspell 8-bit internally is a very
well though out decision that is not likely to change any time soon.
Fell free to challenge me on it, but, don't expect me to change my mind
unless you can bring up some point that I have not thought of before
and quite possible a patch to solve cleanly convert Aspell to Unicode
internally with out a serious performance lost OR serious memory usage
increase.
--
http://kevin.atkinson.dhs.org
More information about the utf-8
mailing list