[utf-8] Language Info Needed for GNU Aspell

Kevin Atkinson kevina at gnu.org
Tue Mar 23 12:43:30 PST 2004


[Please distribute this document as widely as possible.]

GNU Aspell 0.60 should be able to support most of the Word Languages.
This includes languages languages written in Arabic and other scripts
not well supported by an existing 8-bit character set.  Eventually
Aspell should be able to support any current language not based on the
Chinese writing system.

GNU Aspell is a spell checker designed to eventually replace Ispell.
Its main feature is that it does a much better job of coming up with
possible suggestions than just about any other spell checker out there
for the English language, including Ispell and Microsoft Word.
However, starting with Aspell 0.60 is should also be the only Free (as
in Freedom) that can support most languages not written in the Latin or
Cyrillic scripts.

However I, the author of Aspell, know very little about foreign
languages (ie non-English) and what it takes to correctly spell check
them.  Thus, I need other people to educate me.

If you speak a foreign language I would appreciate if you would take
the time too look over the following material and email me with any
additional information you may have.

The first part gives a thorough analysis of the languages which Aspell
can and cannot support.  If you find any of this information is
incorrect please inform me at kevina at gnu.org.

When Aspell 0.60 is released I would like to have dictionaries
available for as many languages as possible.

Therefore, if you know of a Free word list available for a language that
is not currently listed as having a dictionary available I would
appreciate hearing form you.  I am especially interested in working
with someone to add support for languages written in the Arabic
script.  The encoding of the Arabic is quite complicated and I want to
be sure that Aspell can correctly handle it.

I would also appreciate some help converting Ispell dictionaries to
Aspell.  So, if you would like to help convert some of the dictionaries
listed as being available for Ispell please contact me.

The second part lists languages related issues involved in correctly
spell checking a document.  If you can offer any additional insight on
any of the issues discussed, or know of any additional complications
when spell checking a given language, I would appreciate hearing from
you.

The last part discusses why Aspell uses 8-bit characters internally
for your reading pleasure.

All of this material is also included in the Aspell 0.60 manual which
you can find at http://aspell.net/devel-doc/man.


Languages Which Aspell can Support
**********************************

Even though Aspell will remain 8-bit internally it should still be be
able to support any written languages not based on a logographic
script.  The only logographic writing system in current use are those
based on hànzi which includes Chinese, Japanese, and sometimes Korean.

Supported
=========

Aspell 0.60 should be able to support the following languages as, to the
best of my knowledge, they all contain 220 or fewer symbols and can
thus, fit within an 8-bit character set. If an existing character set
does not exists than a new one can be invented. This is true even if the
script is not yet supported by Unicode as the private use area can be
used.

Code   Language Name             Script            Dictionary   Gettext
                                                   Available    Translation

aa     Afar                      Latin             -            -
ab     Abkhazian                 Cyrillic          -            -
ae     Avestan                   Avestan           -            -
af     Afrikaans                 Latin             Yes          -
ak     Akan                      Latin             -            -
an     Aragonese                 Latin             -            -
ar     Arabic                    Arabic            -            -
as     Assamese                  Bengali           -            -
av     Avar                      Cyrillic          -            -
ay     Aymara                    Latin             -            -
az     Azerbaijani               Cyrillic          -            -
az                               Latin             -            -

ba     Bashkir                   Cyrillic          -            -
be     Belarusian                Cyrillic          Planned      Yes
bg     Bulgarian                 Cyrillic          Yes          -
bh     Bihari                    Devanagari        -            -
bi     Bislama                   Latin             -            -
bm     Bambara                   Latin             -            -
bn     Bengali                   Bengali           Planned      -
bo     Tibetan                   Tibetan           -            -
br     Breton                    Latin             Yes          -
bs     Bosnian                   Latin             -            -

ca     Catalan/Valencian         Latin             Yes          -
ce     Chechen                   Cyrillic          -            -
ch     Chamorro                  Latin             -            -
co     Corsican                  Latin             -            -
cr     Cree                      Latin             -            -
cs     Czech                     Latin             Yes          Yes
cu     Old Slavonic              Cyrillic          -            -
cv     Chuvash                   Cyrillic          -            -
cy     Welsh                     Latin             Yes          -

da     Danish                    Latin             Yes          -
de     German                    Latin             Yes          Yes
dv     Divehi                    Thaana            -            -
dz     Dzongkha                  Tibetan           -            -

ee     Ewe                       Latin             -            -
el     Greek                     Greek             Yes          -
en     English                   Latin             Yes          Yes
eo     Esperanto                 Latin             Yes          -
es     Spanish                   Latin             Yes          Incomplete
et     Estonian                  Latin             Planned      -
eu     Basque                    Latin             -            -

fa     Persian                   Arabic            -            -
ff     Fulah                     Latin             -            -
fi     Finnish                   Latin             Planned      -
fj     Fijian                    Latin             -            -
fo     Faroese                   Latin             Yes          -
fr     French                    Latin             Yes          Yes
fy     Frisian                   Latin             -            -

ga     Irish                     Latin             Yes          Yes
gd     Scottish Gaelic           Latin             Planned      -
gl     Gallegan                  Latin             Yes          -
gn     Guarani                   Latin             -            -
gu     Gujarati                  Gujarati          -            -
gv     Manx                      Latin             Planned      -

ha     Hausa                     Latin             -            -
he     Hebrew                    Hebrew            Planned      -
hi     Hindi                     Devanagari        -            -
ho     Hiri Motu                 Latin             -            -
hr     Croatian                  Latin             Yes          -
ht     Haitian Creole            Latin             -            -
hu     Hungarian                 Latin             Planned      -
hy     Armenian                  Armenian          -            -
hz     Herero                    Latin             -            -

ia     Interlingua (IALA)        Latin             Yes          -
id     Indonesian                Latin             Yes          -
ie     Interlingue               Latin             -            -
ig     Igbo                      Latin             -            -
ik     Inupiaq                   Latin             -            -
io     Ido                       Latin             -            -
is     Icelandic                 Latin             Yes          -
it     Italian                   Latin             Yes          -
iu     Inuktitut                 Latin             -            -

jv     Javanese                  Javanese          -            -
jv                               Latin             -            -

ka     Georgian                  Georgian          -            -
kg     Kongo                     Latin             -            -
ki     Kikuyu/Gikuyu             Latin             -            -
kj     Kwanyama                  Latin             -            -
kk     Kazakh                    Cyrillic          -            -
kl     Kalaallisut/Greenlandic   Latin             -            -
kn     Kannada                   Kannada           -            -
ko     Korean                    Hangeul           -            -
kr     Kanuri                    Latin             -            -
ks     Kashmiri                  Arabic            -            -
ks                               Devanagari        -            -
ku     Kurdish                   Arabic            -            -
ku                               Cyrillic          -            -
ku                               Latin             -            -
kv     Komi                      Cyrillic          -            -
kw     Cornish                   Latin             -            -
ky     Kirghiz                   Cyrillic          -            -
ky                               Latin             -            -

la     Latin                     Latin             -            -
lb     Luxembourgish             Latin             Planned      -
lg     Ganda                     Latin             -            -
li     Limburgan                 Latin             -            -
ln     Lingala                   Latin             -            -
lo     Lao                       Lao               -            -
lt     Lithuanian                Latin             Planned      -
lu     Luba-Katanga              Latin             -            -
lv     Latvian                   Latin             -            -

mg     Malagasy                  Latin             -            -
mh     Marshallese               Latin             -            -
mi     Maori                     Latin             Yes          -
mk     Makasar                   Lontara/Makasar   -            -
ml     Malayalam                 Latin             -            -
ml                               Malayalam         -            -
mn     Mongolian                 Cyrillic          -            -
mn                               Mongolian         -            -
mo     Moldavian                 Cyrillic          -            -
mr     Marathi                   Devanagari        -            -
ms     Malay                     Latin             Yes          -
mt     Maltese                   Latin             Planned      -
my     Burmese                   Myanmar           -            -

na     Nauruan                   Latin             -            -
nb     Norwegian Bokmal          Latin             Yes          -
nd     North Ndebele             Latin             -            -
ne     Nepali                    Devanagari        -            -
ng     Ndonga                    Latin             -            -
nl     Dutch                     Latin             Yes          Yes
nn     Norwegian Nynorsk         Latin             Yes          -
nr     South Ndebele             Latin             -            -
nv     Navajo                    Latin             -            -
ny     Nyanja                    Latin             -            -

oc     Occitan/Provencal         Latin             -            -
or     Oriya                     Oriya             -            -
os     Ossetic                   Cyrillic          -            -

pa     Punjabi                   Gurmukhi          -            -
pi     Pali                      Devanagari        -            -
pi                               Sinhala           -            -
pl     Polish                    Latin             Yes          -
ps     Pushto                    Arabic            -            -
pt     Portuguese                Latin             Yes          Yes

qu     Quechua                   Latin             -            -

rm     Raeto-Romance             Latin             -            -
rn     Rundi                     Latin             -            -
ro     Romanian                  Latin             Yes          Yes
ru     Russian                   Cyrillic          Yes          Yes
rw     Kinyarwanda               Latin             -            -

sa     Sanskrit                  Devanagari        -            -
sc     Sardinian                 Latin             -            -
sd     Sindhi                    Arabic            -            -
se     Northern Sami             Latin             -            -
sg     Sango                     Latin             -            -
si     Sinhalese                 Sinhala           -            -
sk     Slovak                    Latin             Yes          -
sl     Slovenian                 Latin             Yes          -
sm     Samoan                    Latin             -            -
sn     Shona                     Latin             -            -
so     Somali                    Latin             -            -
sq     Albanian                  Latin             Planned      -
sr     Serbian                   Cyrillic          -            Yes
sr                               Latin             -            -
ss     Swati                     Latin             -            -
st     Southern Sotho            Latin             -            -
su     Sundanese                 Latin             -            -
sv     Swedish                   Latin             Yes          -
sw     Swahili                   Latin             Planned      -

ta     Tamil                     Tamil             Planned      -
te     Telugu                    Telugu            -            -
tg     Tajik                     Latin             -            -
tk     Turkmen                   Latin             -            -
tl     Tagalog                   Latin             -            -
tl                               Tagalog           -            -
tn     Tswana                    Latin             -            -
to     Tonga                     Latin             -            -
tr     Turkish                   Latin             -            -
ts     Tsonga                    Latin             -            -
tt     Tatar                     Cyrillic          -            -
tw     Twi                       Latin             -            -
ty     Tahitian                  Latin             -            -

ug     Uighur                    Arabic            -            -
ug                               Cyrillic          -            -
ug                               Latin             -            -
uk     Ukrainian                 Cyrillic          Yes          -
ur     Urdu                      Arabic            -            -
uz     Uzbek                     Cyrillic          -            -
uz                               Latin             -            -

ve     Venda                     Latin             -            -
vi     Vietnamese                Latin             -            -
vo     Volapuk                   Latin             -            -

wa     Walloon                   Latin             Planned      Incomplete
wo     Wolof                     Latin             -            -

xh     Xhosa                     Latin             -            -

yi     Yiddish                   Hebrew            -            -
yo     Yoruba                    Latin             -            -

za     Zhuang                    Latin             -            -
zu     Zulu                      Latin             Planned      -

Notes on Latin Languages
------------------------

Any word that can be written using on of the Latin ISO-8859 character
sets (ISO-8859-1,2,3,4,9,10,13,14,15,16) can be written, in decomposed
form, using the ASCII characters, the 23 additional letters:

     U+00C6 LATIN CAPITAL LETTER AE
     U+00D0 LATIN CAPITAL LETTER ETH
     U+00D8 LATIN CAPITAL LETTER O WITH STROKE
     U+00DE LATIN CAPITAL LETTER THORN
     U+00DE LATIN SMALL LETTER THORN
     U+00DF LATIN SMALL LETTER SHARP S
     U+00E6 LATIN SMALL LETTER AE
     U+00F0 LATIN SMALL LETTER ETH
     U+00F8 LATIN SMALL LETTER O WITH STROKE
     U+0110 LATIN CAPITAL LETTER D WITH STROKE
     U+0111 LATIN SMALL LETTER D WITH STROKE
     U+0126 LATIN CAPITAL LETTER H WITH STROKE
     U+0127 LATIN SMALL LETTER H WITH STROKE
     U+0131 LATIN SMALL LETTER DOTLESS I
     U+0138 LATIN SMALL LETTER KRA
     U+0141 LATIN CAPITAL LETTER L WITH STROKE
     U+0142 LATIN SMALL LETTER L WITH STROKE
     U+014A LATIN CAPITAL LETTER ENG
     U+014B LATIN SMALL LETTER ENG
     U+0152 LATIN CAPITAL LIGATURE OE
     U+0153 LATIN SMALL LIGATURE OE
     U+0166 LATIN CAPITAL LETTER T WITH STROKE
     U+0167 LATIN SMALL LETTER T WITH STROKE

   and the 14 modifiers:

     U+0300 COMBINING GRAVE ACCENT
     U+0301 COMBINING ACUTE ACCENT
     U+0302 COMBINING CIRCUMFLEX ACCENT
     U+0303 COMBINING TILDE
     U+0304 COMBINING MACRON
     U+0306 COMBINING BREVE
     U+0307 COMBINING DOT ABOVE
     U+0308 COMBINING DIAERESIS
     U+030A COMBINING RING ABOVE
     U+030B COMBINING DOUBLE ACUTE ACCENT
     U+030C COMBINING CARON
     U+0326 COMBINING COMMA BELOW
     U+0327 COMBINING CEDILLA
     U+0328 COMBINING OGONEK

   Which is a total of 37 additional Unicode code points.

   All ISO-8859 character leaves the characters 0x00 - 0x19 and 0x80 -
0x99 unmapped as they are generally used as control characters.  Of
those, 0x02 - 0x19 and 0x80 - 0x99 may be mapped to anything in Aspell.
This is a total of 62 characters which can be remapped in any ISO-8859
character set.  Thus, by remapping 37 of the 62 characters to the
previously specifed Unicode code-points, any modified ISO-8859 character
set can be used for any Latin languages covered by ISO-8859.  Of course
decomposing every single accented character wastes a lot of space, so
only characters that can be not be represented in the precomposed form
should be broken up.  By using this trick it is possible to store
foreign words in the correctly accented form in the dictionary even if
the precomposed character is not in the current character set.

   Any letter in the Unicode range U+0000 - U+0249, U+1E00..U+1EFF
(Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B,
and Latin Extended Additional) can be represented using around 175
basic letters, and 25 modifiers which is less than 220 and can thus fit
in an Aspell 8-bit character set.  Since this unicode range covers any
possible Latin language this special character set can be used to
reperesnt any word written using the Latin script if so desired.

Hangeul
-------

Koren in generally written in hangeul or a mixture of hanja and hangeul.
Aspell should be able to spell check the hangeul part of the writing.
In Hangeul letters individual letters, known as jamo, are grouped
together in syllable blocks.  Unicode provided code points for both jamo
and the combined syllable block.  The syllable blocks will need to be
decomposed into jamo in order for Aspell to spell check it.

Syllabic
========

Syllabic languages use a separate symbol for each syllable of the
language.  Since most of them have more than 240 distinct characters
Aspell can not support them as is.  However, all hope is not lost as
Aspell will most likely be able to support them in the future.

Code   Language Name   Script
am     Amharic         Ethiopic
cr     Cree            Canadian Syllabics
ii     Sichuan Yi      Yi
iu     Inuktitut       Canadian Syllabics
oj     Ojibwa          Ojibwe
om     Oromo           Ethiopic
ti     Tigrinya        Ethiopic

The Ethiopic Syllabary
----------------------

Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it.  The idea is to split
each character into two parts based on the matrix representation.  The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'.  The combined character will then be mapped with the upper
bits coming first.  Thus each Ethiopic syllabary will have the form
`11?????? 10000???'.  By mapping the first and second parts to separate
8-bit characters it is easy to tell which part represents the consonant
and which part represents the vowel of the syllabary.  This encoding of
the syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16.  In fact, the exiting suggestion strategy of Aspell
will work well with this encoding with out any additional
modifications.  However, additional improvements may be possible by
taking advantage of the consonant-vowel structure of this encoding.

   In fact, the split consonant-vowel representation may prove to be so
useful that it may be beneficial to encode other syllabary in this
fashion, even if they are less than 220 of them.

   The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.60.  However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.

The Yi Syllabary
----------------

A very large syllabary with 819 distince symbols.  However, like
Ethiopic, it should be possible to support this script by breaking it
up.

The Unified Canadian Aboriginal Syllabics
-----------------------------------------

Another very large syllabary.

The Ojibwe Syllabary
--------------------

With only 120 distinct symbols, Aspell can actually support this one as
is.  However, as previously mentioned, it may be beneficial to break it
up into the consonant-vowel representation anyway.

Unsupported
===========

These languages, when written in the given script, are currently
unsupported by Aspell for one reason or another.

Code   Language Name   Script
ja     Japanese        Japanese
km     Khmer           Khmer
ko     Korean          Hanja + Hangeul
pi     Pali            Thai
th     Thai            Thai
zh     Chinese         Hanja

The Thai and Khmer Scripts
--------------------------

The Thai and Khmer scripts presents a different problem for Aspell.  The
problem is not that there are more than 220 unique symbols, but that
there are no spaces between words.  This means that there is no easy way
to split a sentence into individual words.  However, it is still
possible to spell check these scripts, it is just a lot more difficult.
I will be happy to work within someone who is interested in adding Thai
or Khmer support to Aspell, but it is not likely something I will do in
the foreseeable future.

Languages which use Hànzi Characters
------------------------------------

Hànzi Characters are used to write Chinese, Japanese, Korean, and were
once used to write Vietnamese.  Each hànzi character represents a
syllable of a spoken word and also has a meaning.  Since there are
around 3,000 of them in common usage it is unlikely that Aspell will
ever be able to support spell checking languages written using hànzi.
However, I am not even sure if these languages need spell checking since
hànzi characters are generally not entered in directly.  Furthermore
even if Aspell could spell check hànzi the exiting suggestion strategy
will not work well at all, and thus a completely new strategy will need
to be developed.

Japanese
--------

Modern Japanese is written in a mixture of "hiragana", "katakana",
"kanji", and sometimes "romaji".  "Hiragana" and "katakana" are both
syllabaries unique to Japan, "kanji" is a modified form of hànzi, and
"romaji" uses the Latin alphabet.  With some work, Aspell should be
able to check the non-kanji part of Japanese text.  However, based on
my limited understanding of Japanese hiragana is often used at the end
of kanji.  Thus if Aspell was to simply separate out the hiragana from
kanji it would end up with a lot of word endings which are not proper
words and will thus be flagged as misspellings.  However, this can be
fairly easily rectified as text is tokenized into words before it is
converted into Aspell's internal encoding.  In fact, some Japanese text
is written in entirely in one script.  For example books for children
and foreigners are sometimes written entirely in hiragana.  Thus,
Aspell could prove at least somewhat useful for spell checking Japanese.

Languages Written in Multiple Scripts
=====================================

Aspell should be able to check text written in the same language, but in
multiple scripts, with some work.  If the number of unique symbols in
both scripts is less than 220 than a special character set can be used
to allow both scripts to be encoding in the same dictionary.  However
this may not be the most efficient solution.  An alternate solution is
to store each script in its own dictionary and allow Aspell to chose
the correct dictionary based on which script the given word is written
in.  Aspell currently does not support this mode of spell checking
however it is something that I hope to eventually support.

Notes on Planned Dictionaries
=============================

be   Belarusian     Ispell Dictionary available
bn   Bengali        Unoffical Aspell Dictionary available
                    `http://www.bengalinux.org/downloads/'
et   Estonian       Ispell Dictionary available
fi   Finnish        Ispell Dictionary available
gd   Scottish       Ispell Dictionary available.
     Gaelic         `http://packages.debian.org/unstable/text/igaelic'
gv   Manx           Ispell Dictionary available.
                    `http://packages.debian.org/unstable/text/imanx'
he   Hebrew         Ispell Dictionary available
hu   Hungarian      MySpell dictionary expanded to over 500 MB.  Will add
                    once affix support is worked into the dictionary
                    package system.
lb   Luxembourgish  MySpell dictionary planned.
lt   Lithuanian     MySpell dictionary expanded to over 500 MB.  Will add
                    once affix support is worked into the dictionary
                    package system.
mt   Maltese        Unofficial Aspell Dictionary available, but broken
                    link to source.
                    `http://linux.org.mt/article/spellcheck'
sw   Albanian       Ispell Dictionary available
sw   Swahili        Available at
                    `http://sourceforge.net/projects/translate'.  Offical
                    version comming soon.
ta   Tamil          Word list available at
                    `http://www.developer.thamizha.com/spellchecker/index.html'.
                    Working with them to create an Aspell dictionary.
wa   Walloon        Ispell Dictionary available
zu   Zulu           Available at
                    `http://sourceforge.net/projects/translate'.  Offical
                    version comming soon.

References
==========

The information in this chapter was gathered from numerous sources,
including:

   * ISO 639-2 Registration Authority,
     `http://www.loc.gov/standards/iso639-2/'

   * Languages and Scripts (Offical Unicode Site),
     `http://www.unicode.org/onlinedat/languages-scripts.html'

   * Omniglot - a guide to written language, `http://www.omniglot.com/'

   * Winkipedia - The Free Encyclopedia, `http://wikipedia.org/'

   * Ethnologue - Languages of the World, `http://www.ethnologue.com/'

   * World Languages - The Ultimate Language Store,
     `http://www.worldlanguage.com/'

   * South African Languages Web, `http://www.languages.web.za/'

   * The Languages and Writing Systems of Africa (Global Advisor
     Newsletter), `http://www.intersolinc.com/newsletters/africa.htm'


   Special thanks goes to Era Eriksson for helping me the information in
this chapter.


Language Related Issues
***********************

Here are some language related issues that a good spell checker needs to
handle.  If you have any more information about any of these issues, or
of a new issue not discussed here, please email me at <kevina at gnu.org>.

German Sharp S
==============

The German Sharp S or Eszett does not have an uppercase equivalent.
Instead when `ß' is converted to `SS'.  The conversion of `ß' to `SS'
requires a special rule, and increases the length of a word, thus
disallowing inplace case conversion.  Furthermore, my general rule of
converting all words to lowercase before looking them up in the
dictionary won't work because the conversion of `SS' to lowercase is
ambiguous; it can be `ss' or `ß'.  I do plan on dealing with this
eventually, however.

Compound Words
==============

In some languages, such as German, it is acceptable to string two words
together, thus forming a compound word.  However, there are rules to
when this can be done.  Furthermore, it is not always sufficient to
simply concatenate the two words.  For example, sometimes a letter is
inserted between the two words.  I tried implementing support for
compound words in Aspell but it was too limiting and no one used it.
Before I try implementing it again I want to know all the issues
involved.

Context Sensitive Spelling
==========================

In some language, such as Luxembourgish, the spelling of a word depends
on which words surround it.  For example the the letter `n' at the end
of a word will disappear if it is followed by another word starting
with a certain letter such as an `s'.  However, it can probably get
more complicated than that.  I would like to know how complicated before
I attempt to implement support for context sensitive spelling.

Unicode Normalization
=====================

Because Unicode contains a large number of precomposed characters there
are multiple ways a character can be represented.  For example letter
a* can either be represented as

     U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
or
     U+0061 LATIN SMALL LETTER A + U+030A COMBINING RING ABOVE

   By performing normalization first Aspell will only see one of these
representations.  The exact form of normalization depends on the
language.  Give the choice of

  1. Precomposed character

  2. Base letter + combining character(s)

  3. Base letter only

if the precomposed charter is in the target character set then (1), if
both the base and combing character is present than (2), otherwise (3).

Words With Spaces or other Symbols in Them
==========================================

Many languages, including English, have words with non-letter symbols in
them.  For example the apostrophe.  These symbols generally appear in
the middle of a word, but they can also appear at the end, such as in an
abbreviation.  If a symbol can _only_ appear as part of a word than
Aspell can treat it as if it were a letter.

   However, the problem is most of these symbols have other uses.  For
example, the apostrophe is often used as a single quote and the
abbreviations marker is also used as a period.  Thus, Aspell can not
blindly treat them as if they were letters.

   Aspell currently handles the case where the symbol can only appear in
the middle of the word fairly well.  It simply assumes that if there is
a letter both before and after the symbol than it is part of the word.
This works most of the time but it is not fool proof.  For example,
suppose the user forgot to leave a space after the period:

       ... and the dog went up the tree.Then the cat ...

Aspell would think "tree.Then" is one word.  A better solution might be
to then try to check "tree" and "Then" separately.  But what if one of
them is not in the dictionary?  Should Aspell assume "tree.Then" is one
word?

   The case where the symbol can appear at the beginning or end of the
word is more difficult to deal with.  The symbol may or may not
actually be part of the word.  Aspell currently handles this case by
first trying to spell check the word with the symbol and if that fails,
try it without.  The problem is, if the word is misspelled, should
Aspell assume the symbol belongs with the word or not?  Currently
Aspell assumes it does, which is not always the correct thing to do.

   Numbers in words present a different challenge to Aspell.  If Aspell
treats numbers as letters than every possible number a user might write
in a document must be specified in the dictionary.  This could be
easily be solved by having special code to assume all numbers are
correctly spelled.  But what about something like "4th".  Since the
"th" suffix can appear after any number we are left with the same
problem.  The solution would be to have a special symbol for "any
number".

   Words with spaces in them, such as foreign phrases, are even more
trouble to deal with.  The basic problem is that when tokonizing a
string there is no good way to keep phrases together. One solution is to
use trial and error.  If a word is not in the dictionary try grouping it
with the previous or next word and see if the combined word is the
dictionary.  But what if the combined word is not, should the misspelled
word be grouped when looking for suggestions?  One solution is to also
store each part of the phrase in the dictionary, but tag it as part of a
phrase and not an independent word.

   To further complicate things, most applications that use spell
checkers are accustom to parsing the document themselves and sending it
to the spell checker a word at a time.  In order to support word with
spaces in them a more complicated interface will be required.


Notes on 8-bit Characters
*************************

There is a very good reason I use 8-bit characters in Aspell. Speed and
simplicity. While many parts of my code can fairly be easily be
converted to some sort of wide character as my code is clean. Other
parts can not be.

   One of the reasons because is many, many places I use a direct lookup
to find out various information about characters. With 8-bit characters
this is very feasible because there is only 256 of them. With 16-bit
wide characters this will waste a LOT of space. With 32-bit characters
this is just plain impossible. Converting the lookup tables to some
other form, while certainly possible, will degrade performance
significantly.

   Furthermore, some of my algorithms relay on words consisting only on
a small number of distinct characters (often around 30 when case and
accents are not considered). When the possible character can consist of
any Unicode character this number because several thousand, if that. In
order for these algorithms to still be used some sort of limit will
need to be placed on the possible characters the word can contain. If I
impose that limit, I might as well use some sort of 8-bit characters
set which will automatically place the limit on what the characters can
be.

   There is also the issue of how I should store the word lists in
memory? As a string of 32 bit wide characters. Now that is using up 4
times more memory than charters would and for languages that can fit
within an 8-bit character that is, in my view, a gross waste of memory.
So maybe I should store them is some variable width format such as
UTF-8. Unfortunately, way, way to many of may algorithms will simply
not work with variable width characters without significant
modification which will very likely degrade performance. So the
solution is to work with the characters as 32-bit wide characters and
than convert it to a shorter representation when storing them in the
lookup tables. Now than can lead to an inefficiency. I could also use
16 bit wide characters however that may not be good enough to hold all
of future versions of Unicode and it has the same problems.

   As a response to the space waste used by storing word lists in some
sort of wide format some one asked:

     Since hard drive are cheaper and cheaper, you could store
     dictionary in a usable (uncompressed) form and use it directly
     with memory mapping. Then the efficiency would directly depend on
     the disk caching method, and only the used part of the
     dictionaries would relay be loaded into memory. You would no more
     have to load plain dictionaries into main memory, you'll just want
     to compute some indexes (or something like that) after mapping.

   However, the fact of the matter is that most of the dictionary will
be read into memory anyway if it is available. If it is not available
than there would be a good deal of disk swaps. Making characters 32-bit
wide will increase the change that there are more disk swap. So the
bottom line is that it will be cheaper to convert the characters from
something like UTF-8 into some sort of wide character. I could also use
some sort of disk space lookup table such as the Berkeley Database.
However this will *definitely* degrade performance.

   The bottom line is that keeping Aspell 8-bit internally is a very
well though out decision that is not likely to change any time soon.
Fell free to challenge me on it, but, don't expect me to change my mind
unless you can bring up some point that I have not thought of before
and quite possible a patch to solve cleanly convert Aspell to Unicode
internally with out a serious performance lost OR serious memory usage
increase.

-- 
http://kevin.atkinson.dhs.org





More information about the utf-8 mailing list