[UTF-8] Aspell and UTF-8/Unicode

Sun, 15 Feb 2004 16:45:59 +0100

Hi Kevin,
Nice to see "Bad Software" page really has some effect :)

[I'm not the authority there, I just have write access to the wiki
page, and will express my *personal* stance below]

Kevin Atkinson <kevin@atkinson.dhs.org> writes:

> Since my program Aspell, made it on the "Bad Software" list I thought I 
> clarify the situation.

Since I think it was me who added it to the list there, I thought I
clarify the reason. :)

[I snipped most of the text I basically agree with, or didn't want to
make any further comments -- I didn't do it so I could take your
words out of context, so if this appears so anywhere, please excuse me]

Basically, not using UTF-8 is calling for trouble.  I'd describe
the thing I came unto while I was trying to create a Serbian
dictionary (not yet done, there are much words to be looked over, and
some things to think over).

Also note that we're not insisting on "Unicode", but rather on UTF-8
(UCS/Unicode Transformation Format in 8 bits, or something).
Algorithms for doing all sorts of weird things using one
predetermined transformation format like UTF-8 are well established,
and while it isn't simpler for programmer than eg. UCS-4, it's very
useable.

The idea for "UTF-8 everywhere" is that there's no need for conversion
at all (Die you 8-bit character sets! :), so those who choose to use
it as a base will surely benefit, and it was chosen over other TF's
for its many benefits.  Of course, nobody is forced to follow this
same policy, and they may support UTF-8 otherwise well (and that
software wouldn't be on the Bad Software page just for that, of
course) and not use it internally (for instance, grep is listed there
only because it's very slow in UTF-8 locales, and a fix is posted
there as well).

After all, some big collections of software like all those based on
Gtk+ (Gnome, XFCE, ROX) (perhaps KDE/Qt as well?) use UTF-8
internally, and they perform fairly well.  Yes, I understand that
libraries have their separate demands, but there are a lot of
libraries.

>    However, that _doesn't_ mean that the end user has to know this.
> Everything read in by Aspell and everything printed out can be in UTF-8
> or some other encoding.  The only person that _really_ has to be aware
> of the 8-bit issue is the dictionary maintainers as they must chose an
> 8-bit character set for Aspell to use.  Even the word list can be in
> Unicode as Aspell can convert it when creating the dictionary.

Which is exactly the issue -- I, for once, don't know of a single
8-bit character set which supports both Serbian Cyrillic and Serbian
Latin.  And creating a phonetic equivalence is something users would
surely appreciate.  There are some interesting properties of this 
biscript I was hoping to make use of (i.e. detect common misspellings
in Latin script vs. Cyrillic, and use a single dictionary for both of
these).

The other thing is I want to work with composite characters: accented
Cyrillic, which are not available precomposed in either UCS, and in
neither of 8-bit sets as well. 

So, I figured using UTF-8 was the best choice.  Unfortunately, aspell
didn't let that happen.  Which leads me back to my premise: not using
UTF-8 is asking for trouble, because it's bad to make so much
assumptions TODAY [it was probably an good choice a couple of
years back, but today it is not].

>    One of the reasons because is many, many places I use a direct lookup
> to find out various information about characters. With 8-bit characters
> this is very feasible because there is only 256 of them. With 16-bit
> wide characters this will waste a LOT of space. With 32-bit characters
> this is just plain impossible. Converting the lookup tables to some
> other form, while certainly possible, will degrade performance
> significantly.

Conversion from any other encoding to internal 8-bit encoding is just
as slow -- each character in eg. UTF-8 encoded text has to be
decoded, looked up in the chosen 8-bit table (which usually means
going through most of the table, until a match is found).  So, if
your tools are able to make use of iconv() or something to convert
input into internal 8-bit, the "significant performance degradation"
is probably going to be at most double -- all that looking up a table
is going to take as much time as iconv().  Unless, of course, you get
passed a string which is in UTF-8, so no iconv() necessary, and you
get the same execution time as currently :)

[Of course, this is just rough estimate and a lot of guesswork -- I
never profiled or looked at the aspell code, so I may be entirely
wrong -- I agree this is not simpler, but performance can be optimized
for another use case of input text in UTF-8 itself, which is the
ideal IMO]

> Unfortunately, way, way to many of may algorithms will simply
> not work with variable width characters without significant
> modification which will very likely degrade performance. 

I understand that you lack the time to do such significant
modifications.  Yet, that means keeping aspell among "Bad Software".

Perhaps we should add your comments to the Wiki Page somewhere? 
Noah, what do you think -- should I simply add a link to the message
in the archives?

>    The bottom line is that keeping Aspell 8-bit internally is a very
> well though out decision that is not likely to change any time soon.
> Fell free to challenge me on it, but, don't expect me to change my mind
> unless you can bring up some point that I have not thought of before
> and quite possible a patch to solve cleanly convert Aspell to Unicode
> internally with out a serious performance lost OR serious memory usage
> increase.

I'm sorry to say that I lack the time to actually provide even a
start for the patch.  Yet, I hope my points above will make you at
least reconsider your stance.

I really appreciate your taking time to respond here -- thank you very
much for all the explanations you've posted.

Cheers,
Danilo