[utf-8] Language Info Needed for GNU Aspell
Ely Levy
elylevy at cs.huji.ac.il
Tue Mar 23 12:44:29 PST 2004
well for hebrew I suggest taking a look at hspell
http://www.ivrix.org.il/projects/spell-checker/
which also has a GPLed hebrew word list and some intresting features
for semitic languages
Ely Levy
System group
Hebrew University
Jerusalem Israel
On Tue, 23 Mar 2004, Kevin Atkinson wrote:
>
> [Please distribute this document as widely as possible.]
>
> GNU Aspell 0.60 should be able to support most of the Word Languages.
> This includes languages languages written in Arabic and other scripts
> not well supported by an existing 8-bit character set. Eventually
> Aspell should be able to support any current language not based on the
> Chinese writing system.
>
> GNU Aspell is a spell checker designed to eventually replace Ispell.
> Its main feature is that it does a much better job of coming up with
> possible suggestions than just about any other spell checker out there
> for the English language, including Ispell and Microsoft Word.
> However, starting with Aspell 0.60 is should also be the only Free (as
> in Freedom) that can support most languages not written in the Latin or
> Cyrillic scripts.
>
> However I, the author of Aspell, know very little about foreign
> languages (ie non-English) and what it takes to correctly spell check
> them. Thus, I need other people to educate me.
>
> If you speak a foreign language I would appreciate if you would take
> the time too look over the following material and email me with any
> additional information you may have.
>
> The first part gives a thorough analysis of the languages which Aspell
> can and cannot support. If you find any of this information is
> incorrect please inform me at kevina at gnu.org.
>
> When Aspell 0.60 is released I would like to have dictionaries
> available for as many languages as possible.
>
> Therefore, if you know of a Free word list available for a language that
> is not currently listed as having a dictionary available I would
> appreciate hearing form you. I am especially interested in working
> with someone to add support for languages written in the Arabic
> script. The encoding of the Arabic is quite complicated and I want to
> be sure that Aspell can correctly handle it.
>
> I would also appreciate some help converting Ispell dictionaries to
> Aspell. So, if you would like to help convert some of the dictionaries
> listed as being available for Ispell please contact me.
>
> The second part lists languages related issues involved in correctly
> spell checking a document. If you can offer any additional insight on
> any of the issues discussed, or know of any additional complications
> when spell checking a given language, I would appreciate hearing from
> you.
>
> The last part discusses why Aspell uses 8-bit characters internally
> for your reading pleasure.
>
> All of this material is also included in the Aspell 0.60 manual which
> you can find at http://aspell.net/devel-doc/man.
>
>
> Languages Which Aspell can Support
> **********************************
>
> Even though Aspell will remain 8-bit internally it should still be be
> able to support any written languages not based on a logographic
> script. The only logographic writing system in current use are those
> based on hànzi which includes Chinese, Japanese, and sometimes Korean.
>
> Supported
> =========
>
> Aspell 0.60 should be able to support the following languages as, to the
> best of my knowledge, they all contain 220 or fewer symbols and can
> thus, fit within an 8-bit character set. If an existing character set
> does not exists than a new one can be invented. This is true even if the
> script is not yet supported by Unicode as the private use area can be
> used.
>
> Code Language Name Script Dictionary Gettext
> Available Translation
>
> aa Afar Latin - -
> ab Abkhazian Cyrillic - -
> ae Avestan Avestan - -
> af Afrikaans Latin Yes -
> ak Akan Latin - -
> an Aragonese Latin - -
> ar Arabic Arabic - -
> as Assamese Bengali - -
> av Avar Cyrillic - -
> ay Aymara Latin - -
> az Azerbaijani Cyrillic - -
> az Latin - -
>
> ba Bashkir Cyrillic - -
> be Belarusian Cyrillic Planned Yes
> bg Bulgarian Cyrillic Yes -
> bh Bihari Devanagari - -
> bi Bislama Latin - -
> bm Bambara Latin - -
> bn Bengali Bengali Planned -
> bo Tibetan Tibetan - -
> br Breton Latin Yes -
> bs Bosnian Latin - -
>
> ca Catalan/Valencian Latin Yes -
> ce Chechen Cyrillic - -
> ch Chamorro Latin - -
> co Corsican Latin - -
> cr Cree Latin - -
> cs Czech Latin Yes Yes
> cu Old Slavonic Cyrillic - -
> cv Chuvash Cyrillic - -
> cy Welsh Latin Yes -
>
> da Danish Latin Yes -
> de German Latin Yes Yes
> dv Divehi Thaana - -
> dz Dzongkha Tibetan - -
>
> ee Ewe Latin - -
> el Greek Greek Yes -
> en English Latin Yes Yes
> eo Esperanto Latin Yes -
> es Spanish Latin Yes Incomplete
> et Estonian Latin Planned -
> eu Basque Latin - -
>
> fa Persian Arabic - -
> ff Fulah Latin - -
> fi Finnish Latin Planned -
> fj Fijian Latin - -
> fo Faroese Latin Yes -
> fr French Latin Yes Yes
> fy Frisian Latin - -
>
> ga Irish Latin Yes Yes
> gd Scottish Gaelic Latin Planned -
> gl Gallegan Latin Yes -
> gn Guarani Latin - -
> gu Gujarati Gujarati - -
> gv Manx Latin Planned -
>
> ha Hausa Latin - -
> he Hebrew Hebrew Planned -
> hi Hindi Devanagari - -
> ho Hiri Motu Latin - -
> hr Croatian Latin Yes -
> ht Haitian Creole Latin - -
> hu Hungarian Latin Planned -
> hy Armenian Armenian - -
> hz Herero Latin - -
>
> ia Interlingua (IALA) Latin Yes -
> id Indonesian Latin Yes -
> ie Interlingue Latin - -
> ig Igbo Latin - -
> ik Inupiaq Latin - -
> io Ido Latin - -
> is Icelandic Latin Yes -
> it Italian Latin Yes -
> iu Inuktitut Latin - -
>
> jv Javanese Javanese - -
> jv Latin - -
>
> ka Georgian Georgian - -
> kg Kongo Latin - -
> ki Kikuyu/Gikuyu Latin - -
> kj Kwanyama Latin - -
> kk Kazakh Cyrillic - -
> kl Kalaallisut/Greenlandic Latin - -
> kn Kannada Kannada - -
> ko Korean Hangeul - -
> kr Kanuri Latin - -
> ks Kashmiri Arabic - -
> ks Devanagari - -
> ku Kurdish Arabic - -
> ku Cyrillic - -
> ku Latin - -
> kv Komi Cyrillic - -
> kw Cornish Latin - -
> ky Kirghiz Cyrillic - -
> ky Latin - -
>
> la Latin Latin - -
> lb Luxembourgish Latin Planned -
> lg Ganda Latin - -
> li Limburgan Latin - -
> ln Lingala Latin - -
> lo Lao Lao - -
> lt Lithuanian Latin Planned -
> lu Luba-Katanga Latin - -
> lv Latvian Latin - -
>
> mg Malagasy Latin - -
> mh Marshallese Latin - -
> mi Maori Latin Yes -
> mk Makasar Lontara/Makasar - -
> ml Malayalam Latin - -
> ml Malayalam - -
> mn Mongolian Cyrillic - -
> mn Mongolian - -
> mo Moldavian Cyrillic - -
> mr Marathi Devanagari - -
> ms Malay Latin Yes -
> mt Maltese Latin Planned -
> my Burmese Myanmar - -
>
> na Nauruan Latin - -
> nb Norwegian Bokmal Latin Yes -
> nd North Ndebele Latin - -
> ne Nepali Devanagari - -
> ng Ndonga Latin - -
> nl Dutch Latin Yes Yes
> nn Norwegian Nynorsk Latin Yes -
> nr South Ndebele Latin - -
> nv Navajo Latin - -
> ny Nyanja Latin - -
>
> oc Occitan/Provencal Latin - -
> or Oriya Oriya - -
> os Ossetic Cyrillic - -
>
> pa Punjabi Gurmukhi - -
> pi Pali Devanagari - -
> pi Sinhala - -
> pl Polish Latin Yes -
> ps Pushto Arabic - -
> pt Portuguese Latin Yes Yes
>
> qu Quechua Latin - -
>
> rm Raeto-Romance Latin - -
> rn Rundi Latin - -
> ro Romanian Latin Yes Yes
> ru Russian Cyrillic Yes Yes
> rw Kinyarwanda Latin - -
>
> sa Sanskrit Devanagari - -
> sc Sardinian Latin - -
> sd Sindhi Arabic - -
> se Northern Sami Latin - -
> sg Sango Latin - -
> si Sinhalese Sinhala - -
> sk Slovak Latin Yes -
> sl Slovenian Latin Yes -
> sm Samoan Latin - -
> sn Shona Latin - -
> so Somali Latin - -
> sq Albanian Latin Planned -
> sr Serbian Cyrillic - Yes
> sr Latin - -
> ss Swati Latin - -
> st Southern Sotho Latin - -
> su Sundanese Latin - -
> sv Swedish Latin Yes -
> sw Swahili Latin Planned -
>
> ta Tamil Tamil Planned -
> te Telugu Telugu - -
> tg Tajik Latin - -
> tk Turkmen Latin - -
> tl Tagalog Latin - -
> tl Tagalog - -
> tn Tswana Latin - -
> to Tonga Latin - -
> tr Turkish Latin - -
> ts Tsonga Latin - -
> tt Tatar Cyrillic - -
> tw Twi Latin - -
> ty Tahitian Latin - -
>
> ug Uighur Arabic - -
> ug Cyrillic - -
> ug Latin - -
> uk Ukrainian Cyrillic Yes -
> ur Urdu Arabic - -
> uz Uzbek Cyrillic - -
> uz Latin - -
>
> ve Venda Latin - -
> vi Vietnamese Latin - -
> vo Volapuk Latin - -
>
> wa Walloon Latin Planned Incomplete
> wo Wolof Latin - -
>
> xh Xhosa Latin - -
>
> yi Yiddish Hebrew - -
> yo Yoruba Latin - -
>
> za Zhuang Latin - -
> zu Zulu Latin Planned -
>
> Notes on Latin Languages
> ------------------------
>
> Any word that can be written using on of the Latin ISO-8859 character
> sets (ISO-8859-1,2,3,4,9,10,13,14,15,16) can be written, in decomposed
> form, using the ASCII characters, the 23 additional letters:
>
> U+00C6 LATIN CAPITAL LETTER AE
> U+00D0 LATIN CAPITAL LETTER ETH
> U+00D8 LATIN CAPITAL LETTER O WITH STROKE
> U+00DE LATIN CAPITAL LETTER THORN
> U+00DE LATIN SMALL LETTER THORN
> U+00DF LATIN SMALL LETTER SHARP S
> U+00E6 LATIN SMALL LETTER AE
> U+00F0 LATIN SMALL LETTER ETH
> U+00F8 LATIN SMALL LETTER O WITH STROKE
> U+0110 LATIN CAPITAL LETTER D WITH STROKE
> U+0111 LATIN SMALL LETTER D WITH STROKE
> U+0126 LATIN CAPITAL LETTER H WITH STROKE
> U+0127 LATIN SMALL LETTER H WITH STROKE
> U+0131 LATIN SMALL LETTER DOTLESS I
> U+0138 LATIN SMALL LETTER KRA
> U+0141 LATIN CAPITAL LETTER L WITH STROKE
> U+0142 LATIN SMALL LETTER L WITH STROKE
> U+014A LATIN CAPITAL LETTER ENG
> U+014B LATIN SMALL LETTER ENG
> U+0152 LATIN CAPITAL LIGATURE OE
> U+0153 LATIN SMALL LIGATURE OE
> U+0166 LATIN CAPITAL LETTER T WITH STROKE
> U+0167 LATIN SMALL LETTER T WITH STROKE
>
> and the 14 modifiers:
>
> U+0300 COMBINING GRAVE ACCENT
> U+0301 COMBINING ACUTE ACCENT
> U+0302 COMBINING CIRCUMFLEX ACCENT
> U+0303 COMBINING TILDE
> U+0304 COMBINING MACRON
> U+0306 COMBINING BREVE
> U+0307 COMBINING DOT ABOVE
> U+0308 COMBINING DIAERESIS
> U+030A COMBINING RING ABOVE
> U+030B COMBINING DOUBLE ACUTE ACCENT
> U+030C COMBINING CARON
> U+0326 COMBINING COMMA BELOW
> U+0327 COMBINING CEDILLA
> U+0328 COMBINING OGONEK
>
> Which is a total of 37 additional Unicode code points.
>
> All ISO-8859 character leaves the characters 0x00 - 0x19 and 0x80 -
> 0x99 unmapped as they are generally used as control characters. Of
> those, 0x02 - 0x19 and 0x80 - 0x99 may be mapped to anything in Aspell.
> This is a total of 62 characters which can be remapped in any ISO-8859
> character set. Thus, by remapping 37 of the 62 characters to the
> previously specifed Unicode code-points, any modified ISO-8859 character
> set can be used for any Latin languages covered by ISO-8859. Of course
> decomposing every single accented character wastes a lot of space, so
> only characters that can be not be represented in the precomposed form
> should be broken up. By using this trick it is possible to store
> foreign words in the correctly accented form in the dictionary even if
> the precomposed character is not in the current character set.
>
> Any letter in the Unicode range U+0000 - U+0249, U+1E00..U+1EFF
> (Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B,
> and Latin Extended Additional) can be represented using around 175
> basic letters, and 25 modifiers which is less than 220 and can thus fit
> in an Aspell 8-bit character set. Since this unicode range covers any
> possible Latin language this special character set can be used to
> reperesnt any word written using the Latin script if so desired.
>
> Hangeul
> -------
>
> Koren in generally written in hangeul or a mixture of hanja and hangeul.
> Aspell should be able to spell check the hangeul part of the writing.
> In Hangeul letters individual letters, known as jamo, are grouped
> together in syllable blocks. Unicode provided code points for both jamo
> and the combined syllable block. The syllable blocks will need to be
> decomposed into jamo in order for Aspell to spell check it.
>
> Syllabic
> ========
>
> Syllabic languages use a separate symbol for each syllable of the
> language. Since most of them have more than 240 distinct characters
> Aspell can not support them as is. However, all hope is not lost as
> Aspell will most likely be able to support them in the future.
>
> Code Language Name Script
> am Amharic Ethiopic
> cr Cree Canadian Syllabics
> ii Sichuan Yi Yi
> iu Inuktitut Canadian Syllabics
> oj Ojibwa Ojibwe
> om Oromo Ethiopic
> ti Tigrinya Ethiopic
>
> The Ethiopic Syllabary
> ----------------------
>
> Even though the Ethiopic script has more than 220 distinct characters
> with a little work Aspell can still handle it. The idea is to split
> each character into two parts based on the matrix representation. The
> first 3 bits will be the first part and could be mapped to `10000???'.
> The next 6 bits will be the second part and could be mapped to
> `11??????'. The combined character will then be mapped with the upper
> bits coming first. Thus each Ethiopic syllabary will have the form
> `11?????? 10000???'. By mapping the first and second parts to separate
> 8-bit characters it is easy to tell which part represents the consonant
> and which part represents the vowel of the syllabary. This encoding of
> the syllabary is far more useful to Aspell than if they were stored in
> UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell
> will work well with this encoding with out any additional
> modifications. However, additional improvements may be possible by
> taking advantage of the consonant-vowel structure of this encoding.
>
> In fact, the split consonant-vowel representation may prove to be so
> useful that it may be beneficial to encode other syllabary in this
> fashion, even if they are less than 220 of them.
>
> The code to break up a syllabary into the consonant-vowel parts does
> not exists as of Aspell 0.60. However, it will be fairly easy to add
> it as part of the Unicode normalization process once that is written.
>
> The Yi Syllabary
> ----------------
>
> A very large syllabary with 819 distince symbols. However, like
> Ethiopic, it should be possible to support this script by breaking it
> up.
>
> The Unified Canadian Aboriginal Syllabics
> -----------------------------------------
>
> Another very large syllabary.
>
> The Ojibwe Syllabary
> --------------------
>
> With only 120 distinct symbols, Aspell can actually support this one as
> is. However, as previously mentioned, it may be beneficial to break it
> up into the consonant-vowel representation anyway.
>
> Unsupported
> ===========
>
> These languages, when written in the given script, are currently
> unsupported by Aspell for one reason or another.
>
> Code Language Name Script
> ja Japanese Japanese
> km Khmer Khmer
> ko Korean Hanja + Hangeul
> pi Pali Thai
> th Thai Thai
> zh Chinese Hanja
>
> The Thai and Khmer Scripts
> --------------------------
>
> The Thai and Khmer scripts presents a different problem for Aspell. The
> problem is not that there are more than 220 unique symbols, but that
> there are no spaces between words. This means that there is no easy way
> to split a sentence into individual words. However, it is still
> possible to spell check these scripts, it is just a lot more difficult.
> I will be happy to work within someone who is interested in adding Thai
> or Khmer support to Aspell, but it is not likely something I will do in
> the foreseeable future.
>
> Languages which use Hànzi Characters
> ------------------------------------
>
> Hànzi Characters are used to write Chinese, Japanese, Korean, and were
> once used to write Vietnamese. Each hànzi character represents a
> syllable of a spoken word and also has a meaning. Since there are
> around 3,000 of them in common usage it is unlikely that Aspell will
> ever be able to support spell checking languages written using hànzi.
> However, I am not even sure if these languages need spell checking since
> hànzi characters are generally not entered in directly. Furthermore
> even if Aspell could spell check hànzi the exiting suggestion strategy
> will not work well at all, and thus a completely new strategy will need
> to be developed.
>
> Japanese
> --------
>
> Modern Japanese is written in a mixture of "hiragana", "katakana",
> "kanji", and sometimes "romaji". "Hiragana" and "katakana" are both
> syllabaries unique to Japan, "kanji" is a modified form of hànzi, and
> "romaji" uses the Latin alphabet. With some work, Aspell should be
> able to check the non-kanji part of Japanese text. However, based on
> my limited understanding of Japanese hiragana is often used at the end
> of kanji. Thus if Aspell was to simply separate out the hiragana from
> kanji it would end up with a lot of word endings which are not proper
> words and will thus be flagged as misspellings. However, this can be
> fairly easily rectified as text is tokenized into words before it is
> converted into Aspell's internal encoding. In fact, some Japanese text
> is written in entirely in one script. For example books for children
> and foreigners are sometimes written entirely in hiragana. Thus,
> Aspell could prove at least somewhat useful for spell checking Japanese.
>
> Languages Written in Multiple Scripts
> =====================================
>
> Aspell should be able to check text written in the same language, but in
> multiple scripts, with some work. If the number of unique symbols in
> both scripts is less than 220 than a special character set can be used
> to allow both scripts to be encoding in the same dictionary. However
> this may not be the most efficient solution. An alternate solution is
> to store each script in its own dictionary and allow Aspell to chose
> the correct dictionary based on which script the given word is written
> in. Aspell currently does not support this mode of spell checking
> however it is something that I hope to eventually support.
>
> Notes on Planned Dictionaries
> =============================
>
> be Belarusian Ispell Dictionary available
> bn Bengali Unoffical Aspell Dictionary available
> `http://www.bengalinux.org/downloads/'
> et Estonian Ispell Dictionary available
> fi Finnish Ispell Dictionary available
> gd Scottish Ispell Dictionary available.
> Gaelic `http://packages.debian.org/unstable/text/igaelic'
> gv Manx Ispell Dictionary available.
> `http://packages.debian.org/unstable/text/imanx'
> he Hebrew Ispell Dictionary available
> hu Hungarian MySpell dictionary expanded to over 500 MB. Will add
> once affix support is worked into the dictionary
> package system.
> lb Luxembourgish MySpell dictionary planned.
> lt Lithuanian MySpell dictionary expanded to over 500 MB. Will add
> once affix support is worked into the dictionary
> package system.
> mt Maltese Unofficial Aspell Dictionary available, but broken
> link to source.
> `http://linux.org.mt/article/spellcheck'
> sw Albanian Ispell Dictionary available
> sw Swahili Available at
> `http://sourceforge.net/projects/translate'. Offical
> version comming soon.
> ta Tamil Word list available at
> `http://www.developer.thamizha.com/spellchecker/index.html'.
> Working with them to create an Aspell dictionary.
> wa Walloon Ispell Dictionary available
> zu Zulu Available at
> `http://sourceforge.net/projects/translate'. Offical
> version comming soon.
>
> References
> ==========
>
> The information in this chapter was gathered from numerous sources,
> including:
>
> * ISO 639-2 Registration Authority,
> `http://www.loc.gov/standards/iso639-2/'
>
> * Languages and Scripts (Offical Unicode Site),
> `http://www.unicode.org/onlinedat/languages-scripts.html'
>
> * Omniglot - a guide to written language, `http://www.omniglot.com/'
>
> * Winkipedia - The Free Encyclopedia, `http://wikipedia.org/'
>
> * Ethnologue - Languages of the World, `http://www.ethnologue.com/'
>
> * World Languages - The Ultimate Language Store,
> `http://www.worldlanguage.com/'
>
> * South African Languages Web, `http://www.languages.web.za/'
>
> * The Languages and Writing Systems of Africa (Global Advisor
> Newsletter), `http://www.intersolinc.com/newsletters/africa.htm'
>
>
> Special thanks goes to Era Eriksson for helping me the information in
> this chapter.
>
>
> Language Related Issues
> ***********************
>
> Here are some language related issues that a good spell checker needs to
> handle. If you have any more information about any of these issues, or
> of a new issue not discussed here, please email me at <kevina at gnu.org>.
>
> German Sharp S
> ==============
>
> The German Sharp S or Eszett does not have an uppercase equivalent.
> Instead when `ß' is converted to `SS'. The conversion of `ß' to `SS'
> requires a special rule, and increases the length of a word, thus
> disallowing inplace case conversion. Furthermore, my general rule of
> converting all words to lowercase before looking them up in the
> dictionary won't work because the conversion of `SS' to lowercase is
> ambiguous; it can be `ss' or `ß'. I do plan on dealing with this
> eventually, however.
>
> Compound Words
> ==============
>
> In some languages, such as German, it is acceptable to string two words
> together, thus forming a compound word. However, there are rules to
> when this can be done. Furthermore, it is not always sufficient to
> simply concatenate the two words. For example, sometimes a letter is
> inserted between the two words. I tried implementing support for
> compound words in Aspell but it was too limiting and no one used it.
> Before I try implementing it again I want to know all the issues
> involved.
>
> Context Sensitive Spelling
> ==========================
>
> In some language, such as Luxembourgish, the spelling of a word depends
> on which words surround it. For example the the letter `n' at the end
> of a word will disappear if it is followed by another word starting
> with a certain letter such as an `s'. However, it can probably get
> more complicated than that. I would like to know how complicated before
> I attempt to implement support for context sensitive spelling.
>
> Unicode Normalization
> =====================
>
> Because Unicode contains a large number of precomposed characters there
> are multiple ways a character can be represented. For example letter
> a* can either be represented as
>
> U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
> or
> U+0061 LATIN SMALL LETTER A + U+030A COMBINING RING ABOVE
>
> By performing normalization first Aspell will only see one of these
> representations. The exact form of normalization depends on the
> language. Give the choice of
>
> 1. Precomposed character
>
> 2. Base letter + combining character(s)
>
> 3. Base letter only
>
> if the precomposed charter is in the target character set then (1), if
> both the base and combing character is present than (2), otherwise (3).
>
> Words With Spaces or other Symbols in Them
> ==========================================
>
> Many languages, including English, have words with non-letter symbols in
> them. For example the apostrophe. These symbols generally appear in
> the middle of a word, but they can also appear at the end, such as in an
> abbreviation. If a symbol can _only_ appear as part of a word than
> Aspell can treat it as if it were a letter.
>
> However, the problem is most of these symbols have other uses. For
> example, the apostrophe is often used as a single quote and the
> abbreviations marker is also used as a period. Thus, Aspell can not
> blindly treat them as if they were letters.
>
> Aspell currently handles the case where the symbol can only appear in
> the middle of the word fairly well. It simply assumes that if there is
> a letter both before and after the symbol than it is part of the word.
> This works most of the time but it is not fool proof. For example,
> suppose the user forgot to leave a space after the period:
>
> ... and the dog went up the tree.Then the cat ...
>
> Aspell would think "tree.Then" is one word. A better solution might be
> to then try to check "tree" and "Then" separately. But what if one of
> them is not in the dictionary? Should Aspell assume "tree.Then" is one
> word?
>
> The case where the symbol can appear at the beginning or end of the
> word is more difficult to deal with. The symbol may or may not
> actually be part of the word. Aspell currently handles this case by
> first trying to spell check the word with the symbol and if that fails,
> try it without. The problem is, if the word is misspelled, should
> Aspell assume the symbol belongs with the word or not? Currently
> Aspell assumes it does, which is not always the correct thing to do.
>
> Numbers in words present a different challenge to Aspell. If Aspell
> treats numbers as letters than every possible number a user might write
> in a document must be specified in the dictionary. This could be
> easily be solved by having special code to assume all numbers are
> correctly spelled. But what about something like "4th". Since the
> "th" suffix can appear after any number we are left with the same
> problem. The solution would be to have a special symbol for "any
> number".
>
> Words with spaces in them, such as foreign phrases, are even more
> trouble to deal with. The basic problem is that when tokonizing a
> string there is no good way to keep phrases together. One solution is to
> use trial and error. If a word is not in the dictionary try grouping it
> with the previous or next word and see if the combined word is the
> dictionary. But what if the combined word is not, should the misspelled
> word be grouped when looking for suggestions? One solution is to also
> store each part of the phrase in the dictionary, but tag it as part of a
> phrase and not an independent word.
>
> To further complicate things, most applications that use spell
> checkers are accustom to parsing the document themselves and sending it
> to the spell checker a word at a time. In order to support word with
> spaces in them a more complicated interface will be required.
>
>
> Notes on 8-bit Characters
> *************************
>
> There is a very good reason I use 8-bit characters in Aspell. Speed and
> simplicity. While many parts of my code can fairly be easily be
> converted to some sort of wide character as my code is clean. Other
> parts can not be.
>
> One of the reasons because is many, many places I use a direct lookup
> to find out various information about characters. With 8-bit characters
> this is very feasible because there is only 256 of them. With 16-bit
> wide characters this will waste a LOT of space. With 32-bit characters
> this is just plain impossible. Converting the lookup tables to some
> other form, while certainly possible, will degrade performance
> significantly.
>
> Furthermore, some of my algorithms relay on words consisting only on
> a small number of distinct characters (often around 30 when case and
> accents are not considered). When the possible character can consist of
> any Unicode character this number because several thousand, if that. In
> order for these algorithms to still be used some sort of limit will
> need to be placed on the possible characters the word can contain. If I
> impose that limit, I might as well use some sort of 8-bit characters
> set which will automatically place the limit on what the characters can
> be.
>
> There is also the issue of how I should store the word lists in
> memory? As a string of 32 bit wide characters. Now that is using up 4
> times more memory than charters would and for languages that can fit
> within an 8-bit character that is, in my view, a gross waste of memory.
> So maybe I should store them is some variable width format such as
> UTF-8. Unfortunately, way, way to many of may algorithms will simply
> not work with variable width characters without significant
> modification which will very likely degrade performance. So the
> solution is to work with the characters as 32-bit wide characters and
> than convert it to a shorter representation when storing them in the
> lookup tables. Now than can lead to an inefficiency. I could also use
> 16 bit wide characters however that may not be good enough to hold all
> of future versions of Unicode and it has the same problems.
>
> As a response to the space waste used by storing word lists in some
> sort of wide format some one asked:
>
> Since hard drive are cheaper and cheaper, you could store
> dictionary in a usable (uncompressed) form and use it directly
> with memory mapping. Then the efficiency would directly depend on
> the disk caching method, and only the used part of the
> dictionaries would relay be loaded into memory. You would no more
> have to load plain dictionaries into main memory, you'll just want
> to compute some indexes (or something like that) after mapping.
>
> However, the fact of the matter is that most of the dictionary will
> be read into memory anyway if it is available. If it is not available
> than there would be a good deal of disk swaps. Making characters 32-bit
> wide will increase the change that there are more disk swap. So the
> bottom line is that it will be cheaper to convert the characters from
> something like UTF-8 into some sort of wide character. I could also use
> some sort of disk space lookup table such as the Berkeley Database.
> However this will *definitely* degrade performance.
>
> The bottom line is that keeping Aspell 8-bit internally is a very
> well though out decision that is not likely to change any time soon.
> Fell free to challenge me on it, but, don't expect me to change my mind
> unless you can bring up some point that I have not thought of before
> and quite possible a patch to solve cleanly convert Aspell to Unicode
> internally with out a serious performance lost OR serious memory usage
> increase.
>
> --
> http://kevin.atkinson.dhs.org
>
>
>
> _______________________________________________
> utf-8 mailing list
> utf-8 at freedesktop.org
> https://freedesktop.org/cgi-bin/mailman/listinfo/utf-8
>
More information about the utf-8
mailing list