[UTF-8] Aspell and UTF-8/Unicode

Kevin Atkinson kevin@atkinson.dhs.org
Sun, 15 Feb 2004 12:02:52 -0500 (EST)


On Sun, 15 Feb 2004, Danilo Segan wrote:

> Which is exactly the issue -- I, for once, don't know of a single
> 8-bit character set which supports both Serbian Cyrillic and Serbian
> Latin.  And creating a phonetic equivalence is something users would
> surely appreciate.  There are some interesting properties of this 
> biscript I was hoping to make use of (i.e. detect common misspellings
> in Latin script vs. Cyrillic, and use a single dictionary for both of
> these).

Can you please email me the specifics of exactly what you want to do.

> Conversion from any other encoding to internal 8-bit encoding is just
> as slow -- each character in eg. UTF-8 encoded text has to be
> decoded, looked up in the chosen 8-bit table (which usually means
> going through most of the table, until a match is found).  

The text in converted ONCE as it is read in.  If you knew how Aspell 
checks documents you would know that this one time conversion is defiantly 
not a bottleneck.

> So, if
> your tools are able to make use of iconv() or something to convert
> input into internal 8-bit, the "significant performance degradation"
> is probably going to be at most double -- all that looking up a table
> is going to take as much time as iconv().  Unless, of course, you get
> passed a string which is in UTF-8, so no iconv() necessary, and you
> get the same execution time as currently :)
> 
> [Of course, this is just rough estimate and a lot of guesswork -- I
> never profiled or looked at the aspell code, so I may be entirely
> wrong -- I agree this is not simpler, but performance can be optimized
> for another use case of input text in UTF-8 itself, which is the
> ideal IMO]

Yes it is.  Aspell does more than merely look for a word in some hash 
table.  For one thing it has to deal with capitalization.

Also, only the Aspell CORE is 8-bit (well other parts may be but 
converting them will not be difficult at all).  In particular the stuff in 
modules/speller/default/.   The core consists of checking if a word is in 
the dictionary and coming up with suggestions.  In principle other spell 
checker engines can be used, perhaps one that is fully Unicode, but 
doesn't have all the features that the core Aspell speller has.



-- 
http://kevin.atkinson.dhs.org