[UTF-8] Aspell and UTF-8/Unicode

Kevin Atkinson kevin@atkinson.dhs.org
Sun, 15 Feb 2004 12:57:57 -0500 (EST)


On Sun, 15 Feb 2004, Elias Martenson wrote:

> 2004-02-15 klockan 16.56 skrev Kevin Atkinson:
> 
> First of all, I understand what your stance is, and I can understand
> that you feel the potential gain from using Unicode is not worth the
> (major) undertaking in changing the code. I'm just trying to explain
> myself here. :-)  A lot of people uses Aspell and likes it, so you
> obviously have something good going here. Note that "bad software"
> refers to "bad UTF-8 support", not "this application sucks". In general,
> only good applications (apart from the problems with UTF-8) are on the
> list.
> 
> > Well if implemented carefully probably not.  But it will be far from a 
> > trivial task.  And I just don't have the time.  Most all languages that 
> > can be spell checked fit inside an 8-bit character set.
> 
> Not really. There are three different cases where 8 bits are not
> sufficient:
> 
> 1) The language in question has more than 255 characters. One such
> example is Ethiopian.

In that case a conversion to use Unicode would be sufficient.

> 2) There exists a need to combine multiple alphabets. Japanese is one
> example, and as Danilo mentioned, Serbian uses both latin and cyrillic
> characters.

Well one approach is to put them into the same dictionary but that is not 
necessary the best approach.  An alternative, more efficient, approach is 
to use one dictionary for each alphabet and detect which dictionary to 
used by the alphabet in question.

> 3) There is a need for combining marks. There are a set of characters
> that are used in various languages but which does not have a code point
> in Unicode. These are built up using combining marks. Again, according
> to Danilo, these are used in serbian.

Aspell is not designed to handle this situation at the moment, with or 
without Unicode support.

> Thai, however, brings us to another problem which is not really unicode
> related. Many manguages doesn't use space to separate words. 

Spell checking this requires a completely different approach.

> Thai is one
> such language, but also my native language, Swedish, is terribly hard to
> spell check well, sice we combine several words into new ones on the fly
> and most of these combinations couldn't be but in a word list since it
> simply be too large.

Well Aspell might be able to handle this in the future.  See my to do list 
http://aspell.net

-- 
http://kevin.atkinson.dhs.org