[UTF-8] Aspell and UTF-8/Unicode

Kevin Atkinson kevin@atkinson.dhs.org
Sun, 15 Feb 2004 04:54:48 -0500 (EST)


Since my program Aspell, made it on the "Bad Software" list I thought I 
clarify the situation.

The Aspell _library_ supports Unicode fairly well.  All one has to do
is set the encoding and everything is converted on the fly.

   The Aspell _utility_ doesn't really support Unicode at this time.
Although if you set the encoding to UTF-8 the 'PIPE' mode will probably
work and the 'CHECK' mode will sort of work.  The issue with the
'check' mode is using Unicode with the curses library correctly.

   The _rest_ of the Aspell utility does not support Unicode.  The
encoding used is whatever the dictionary is encoded in.

   _Internally_ the core Aspell library (everything in
modules/speller/default) is 8-bit and is very likely to stay that way.
*Note Notes on 8-bit Characters::.

   However, that _doesn't_ mean that the end user has to know this.
Everything read in by Aspell and everything printed out can be in UTF-8
or some other encoding.  The only person that _really_ has to be aware
of the 8-bit issue is the dictionary maintainers as they must chose an
8-bit character set for Aspell to use.  Even the word list can be in
Unicode as Aspell can convert it when creating the dictionary.

   But, Aspell is not there yet.  There are several issues in supporting
Unicode.  The main issue is converting everything read in and printed
out from the internal 8-bit encoding to Unicode.  But an other issue is
knowing what encoding the various data files are in.  There is no
reason they can't be in Unicode but Aspell has to know to convert it.
Thus they need to be tagged in someway.  By data files I mean any human
readable files that Aspell uses.

   Ideally the Aspell utility use the `LC_CTYPE' for the encoding,
however using `LC_CTYPE' to figure out how the data files are encoded
will lead to nothing but trouble.

   So, I _hope_ to get all this done by Aspell 0.51 but I don't know if
I will.  If someone would like to assist me in this task I would really
appreciate it.

Notes on 8-bit Characters
=========================

There is a very good reason I use 8-bit characters in Aspell. Speed and
simplicity. While many parts of my code can fairly be easily be
converted to some sort of wide character as my code is clean. Other
parts can not be.

   One of the reasons because is many, many places I use a direct lookup
to find out various information about characters. With 8-bit characters
this is very feasible because there is only 256 of them. With 16-bit
wide characters this will waste a LOT of space. With 32-bit characters
this is just plain impossible. Converting the lookup tables to some
other form, while certainly possible, will degrade performance
significantly.

   Furthermore, some of my algorithms relay on words consisting only on
a small number of distinct characters (often around 30 when case and
accents are not considered). When the possible character can consist of
any Unicode character this number because several thousand, if that. In
order for these algorithms to still be used some sort of limit will
need to be placed on the possible characters the word can contain. If I
impose that limit, I might as well use some sort of 8-bit characters
set which will automatically place the limit on what the characters can
be.

   There is also the issue of how I should store the word lists in
memory? As a string of 32 bit wide characters. Now that is using up 4
times more memory than charters would and for languages that can fit
within an 8-bit character that is, in my view, a gross waste of memory.
So maybe I should store them is some variable width format such as
UTF-8. Unfortunately, way, way to many of may algorithms will simply
not work with variable width characters without significant
modification which will very likely degrade performance. So the
solution is to work with the characters as 32-bit wide characters and
than convert it to a shorter representation when storing them in the
lookup tables. Now than can lead to an inefficiency. I could also use
16 bit wide characters however that may not be good enough to hold all
of future versions of Unicode and it has the same problems.

   As a response to the space waste used by storing word lists in some
sort of wide format some one asked:

     Since hard drive are cheaper and cheaper, you could store
     dictionary in a usable (uncompressed) form and use it directly
     with memory mapping. Then the efficiency would directly depend on
     the disk caching method, and only the used part of the
     dictionaries would relay be loaded into memory. You would no more
     have to load plain dictionaries into main memory, you'll just want
     to compute some indexes (or something like that) after mapping.

   However, the fact of the matter is that most of the dictionary will
be read into memory anyway if it is available. If it is not available
than there would be a good deal of disk swaps. Making characters 32-bit
wide will increase the change that there are more disk swap. So the
bottom line is that it will be cheaper to convert the characters from
something like UTF-8 into some sort of wide character. I could also use
some sort of disk space lookup table such as the Berkeley Database.
However this will *definitely* degrade performance.

   The bottom line is that keeping Aspell 8-bit internally is a very
well though out decision that is not likely to change any time soon.
Fell free to challenge me on it, but, don't expect me to change my mind
unless you can bring up some point that I have not thought of before
and quite possible a patch to solve cleanly convert Aspell to Unicode
internally with out a serious performance lost OR serious memory usage
increase.

--- 
http://kevin.atkinson.dhs.org