[UTF-8] Aspell and UTF-8/Unicode

Elias Martenson elias-m@algonet.se
Sun, 15 Feb 2004 18:22:41 +0100


s=C3=B6n 2004-02-15 klockan 16.56 skrev Kevin Atkinson:

First of all, I understand what your stance is, and I can understand
that you feel the potential gain from using Unicode is not worth the
(major) undertaking in changing the code. I'm just trying to explain
myself here. :-)  A lot of people uses Aspell and likes it, so you
obviously have something good going here. Note that "bad software"
refers to "bad UTF-8 support", not "this application sucks". In general,
only good applications (apart from the problems with UTF-8) are on the
list.

> Well if implemented carefully probably not.  But it will be far from a=20
> trivial task.  And I just don't have the time.  Most all languages that=
=20
> can be spell checked fit inside an 8-bit character set.

Not really. There are three different cases where 8 bits are not
sufficient:

1) The language in question has more than 255 characters. One such
example is Ethiopian.

2) There exists a need to combine multiple alphabets. Japanese is one
example, and as Danilo mentioned, Serbian uses both latin and cyrillic
characters.

3) There is a need for combining marks. There are a set of characters
that are used in various languages but which does not have a code point
in Unicode. These are built up using combining marks. Again, according
to Danilo, these are used in serbian.

There are also other languages which heavily uses combining marks, for
example Thai. I believe pretty much every character in Thai text are
followed by one or more combining marks.

Thai, however, brings us to another problem which is not really unicode
related. Many manguages doesn't use space to separate words. Thai is one
such language, but also my native language, Swedish, is terribly hard to
spell check well, sice we combine several words into new ones on the fly
and most of these combinations couldn't be but in a word list since it
simply be too large.

Am I correct in the assertion that doing tomething to deal with both of
these issues would take rewriting the core entirely?

Regards

Elias M=C3=A5rtenson