[utf-8] Language Info Needed for GNU Aspell

Ely Levy elylevy at cs.huji.ac.il
Tue Mar 23 12:44:29 PST 2004


well for hebrew I suggest taking a look at hspell
http://www.ivrix.org.il/projects/spell-checker/
which also has a GPLed hebrew word list and some intresting features
for semitic languages

Ely Levy
System group
Hebrew University
Jerusalem Israel



On Tue, 23 Mar 2004, Kevin Atkinson wrote:

>
> [Please distribute this document as widely as possible.]
>
> GNU Aspell 0.60 should be able to support most of the Word Languages.
> This includes languages languages written in Arabic and other scripts
> not well supported by an existing 8-bit character set.  Eventually
> Aspell should be able to support any current language not based on the
> Chinese writing system.
>
> GNU Aspell is a spell checker designed to eventually replace Ispell.
> Its main feature is that it does a much better job of coming up with
> possible suggestions than just about any other spell checker out there
> for the English language, including Ispell and Microsoft Word.
> However, starting with Aspell 0.60 is should also be the only Free (as
> in Freedom) that can support most languages not written in the Latin or
> Cyrillic scripts.
>
> However I, the author of Aspell, know very little about foreign
> languages (ie non-English) and what it takes to correctly spell check
> them.  Thus, I need other people to educate me.
>
> If you speak a foreign language I would appreciate if you would take
> the time too look over the following material and email me with any
> additional information you may have.
>
> The first part gives a thorough analysis of the languages which Aspell
> can and cannot support.  If you find any of this information is
> incorrect please inform me at kevina at gnu.org.
>
> When Aspell 0.60 is released I would like to have dictionaries
> available for as many languages as possible.
>
> Therefore, if you know of a Free word list available for a language that
> is not currently listed as having a dictionary available I would
> appreciate hearing form you.  I am especially interested in working
> with someone to add support for languages written in the Arabic
> script.  The encoding of the Arabic is quite complicated and I want to
> be sure that Aspell can correctly handle it.
>
> I would also appreciate some help converting Ispell dictionaries to
> Aspell.  So, if you would like to help convert some of the dictionaries
> listed as being available for Ispell please contact me.
>
> The second part lists languages related issues involved in correctly
> spell checking a document.  If you can offer any additional insight on
> any of the issues discussed, or know of any additional complications
> when spell checking a given language, I would appreciate hearing from
> you.
>
> The last part discusses why Aspell uses 8-bit characters internally
> for your reading pleasure.
>
> All of this material is also included in the Aspell 0.60 manual which
> you can find at http://aspell.net/devel-doc/man.
>
>
> Languages Which Aspell can Support
> **********************************
>
> Even though Aspell will remain 8-bit internally it should still be be
> able to support any written languages not based on a logographic
> script.  The only logographic writing system in current use are those
> based on hànzi which includes Chinese, Japanese, and sometimes Korean.
>
> Supported
> =========
>
> Aspell 0.60 should be able to support the following languages as, to the
> best of my knowledge, they all contain 220 or fewer symbols and can
> thus, fit within an 8-bit character set. If an existing character set
> does not exists than a new one can be invented. This is true even if the
> script is not yet supported by Unicode as the private use area can be
> used.
>
> Code   Language Name             Script            Dictionary   Gettext
>                                                    Available    Translation
>
> aa     Afar                      Latin             -            -
> ab     Abkhazian                 Cyrillic          -            -
> ae     Avestan                   Avestan           -            -
> af     Afrikaans                 Latin             Yes          -
> ak     Akan                      Latin             -            -
> an     Aragonese                 Latin             -            -
> ar     Arabic                    Arabic            -            -
> as     Assamese                  Bengali           -            -
> av     Avar                      Cyrillic          -            -
> ay     Aymara                    Latin             -            -
> az     Azerbaijani               Cyrillic          -            -
> az                               Latin             -            -
>
> ba     Bashkir                   Cyrillic          -            -
> be     Belarusian                Cyrillic          Planned      Yes
> bg     Bulgarian                 Cyrillic          Yes          -
> bh     Bihari                    Devanagari        -            -
> bi     Bislama                   Latin             -            -
> bm     Bambara                   Latin             -            -
> bn     Bengali                   Bengali           Planned      -
> bo     Tibetan                   Tibetan           -            -
> br     Breton                    Latin             Yes          -
> bs     Bosnian                   Latin             -            -
>
> ca     Catalan/Valencian         Latin             Yes          -
> ce     Chechen                   Cyrillic          -            -
> ch     Chamorro                  Latin             -            -
> co     Corsican                  Latin             -            -
> cr     Cree                      Latin             -            -
> cs     Czech                     Latin             Yes          Yes
> cu     Old Slavonic              Cyrillic          -            -
> cv     Chuvash                   Cyrillic          -            -
> cy     Welsh                     Latin             Yes          -
>
> da     Danish                    Latin             Yes          -
> de     German                    Latin             Yes          Yes
> dv     Divehi                    Thaana            -            -
> dz     Dzongkha                  Tibetan           -            -
>
> ee     Ewe                       Latin             -            -
> el     Greek                     Greek             Yes          -
> en     English                   Latin             Yes          Yes
> eo     Esperanto                 Latin             Yes          -
> es     Spanish                   Latin             Yes          Incomplete
> et     Estonian                  Latin             Planned      -
> eu     Basque                    Latin             -            -
>
> fa     Persian                   Arabic            -            -
> ff     Fulah                     Latin             -            -
> fi     Finnish                   Latin             Planned      -
> fj     Fijian                    Latin             -            -
> fo     Faroese                   Latin             Yes          -
> fr     French                    Latin             Yes          Yes
> fy     Frisian                   Latin             -            -
>
> ga     Irish                     Latin             Yes          Yes
> gd     Scottish Gaelic           Latin             Planned      -
> gl     Gallegan                  Latin             Yes          -
> gn     Guarani                   Latin             -            -
> gu     Gujarati                  Gujarati          -            -
> gv     Manx                      Latin             Planned      -
>
> ha     Hausa                     Latin             -            -
> he     Hebrew                    Hebrew            Planned      -
> hi     Hindi                     Devanagari        -            -
> ho     Hiri Motu                 Latin             -            -
> hr     Croatian                  Latin             Yes          -
> ht     Haitian Creole            Latin             -            -
> hu     Hungarian                 Latin             Planned      -
> hy     Armenian                  Armenian          -            -
> hz     Herero                    Latin             -            -
>
> ia     Interlingua (IALA)        Latin             Yes          -
> id     Indonesian                Latin             Yes          -
> ie     Interlingue               Latin             -            -
> ig     Igbo                      Latin             -            -
> ik     Inupiaq                   Latin             -            -
> io     Ido                       Latin             -            -
> is     Icelandic                 Latin             Yes          -
> it     Italian                   Latin             Yes          -
> iu     Inuktitut                 Latin             -            -
>
> jv     Javanese                  Javanese          -            -
> jv                               Latin             -            -
>
> ka     Georgian                  Georgian          -            -
> kg     Kongo                     Latin             -            -
> ki     Kikuyu/Gikuyu             Latin             -            -
> kj     Kwanyama                  Latin             -            -
> kk     Kazakh                    Cyrillic          -            -
> kl     Kalaallisut/Greenlandic   Latin             -            -
> kn     Kannada                   Kannada           -            -
> ko     Korean                    Hangeul           -            -
> kr     Kanuri                    Latin             -            -
> ks     Kashmiri                  Arabic            -            -
> ks                               Devanagari        -            -
> ku     Kurdish                   Arabic            -            -
> ku                               Cyrillic          -            -
> ku                               Latin             -            -
> kv     Komi                      Cyrillic          -            -
> kw     Cornish                   Latin             -            -
> ky     Kirghiz                   Cyrillic          -            -
> ky                               Latin             -            -
>
> la     Latin                     Latin             -            -
> lb     Luxembourgish             Latin             Planned      -
> lg     Ganda                     Latin             -            -
> li     Limburgan                 Latin             -            -
> ln     Lingala                   Latin             -            -
> lo     Lao                       Lao               -            -
> lt     Lithuanian                Latin             Planned      -
> lu     Luba-Katanga              Latin             -            -
> lv     Latvian                   Latin             -            -
>
> mg     Malagasy                  Latin             -            -
> mh     Marshallese               Latin             -            -
> mi     Maori                     Latin             Yes          -
> mk     Makasar                   Lontara/Makasar   -            -
> ml     Malayalam                 Latin             -            -
> ml                               Malayalam         -            -
> mn     Mongolian                 Cyrillic          -            -
> mn                               Mongolian         -            -
> mo     Moldavian                 Cyrillic          -            -
> mr     Marathi                   Devanagari        -            -
> ms     Malay                     Latin             Yes          -
> mt     Maltese                   Latin             Planned      -
> my     Burmese                   Myanmar           -            -
>
> na     Nauruan                   Latin             -            -
> nb     Norwegian Bokmal          Latin             Yes          -
> nd     North Ndebele             Latin             -            -
> ne     Nepali                    Devanagari        -            -
> ng     Ndonga                    Latin             -            -
> nl     Dutch                     Latin             Yes          Yes
> nn     Norwegian Nynorsk         Latin             Yes          -
> nr     South Ndebele             Latin             -            -
> nv     Navajo                    Latin             -            -
> ny     Nyanja                    Latin             -            -
>
> oc     Occitan/Provencal         Latin             -            -
> or     Oriya                     Oriya             -            -
> os     Ossetic                   Cyrillic          -            -
>
> pa     Punjabi                   Gurmukhi          -            -
> pi     Pali                      Devanagari        -            -
> pi                               Sinhala           -            -
> pl     Polish                    Latin             Yes          -
> ps     Pushto                    Arabic            -            -
> pt     Portuguese                Latin             Yes          Yes
>
> qu     Quechua                   Latin             -            -
>
> rm     Raeto-Romance             Latin             -            -
> rn     Rundi                     Latin             -            -
> ro     Romanian                  Latin             Yes          Yes
> ru     Russian                   Cyrillic          Yes          Yes
> rw     Kinyarwanda               Latin             -            -
>
> sa     Sanskrit                  Devanagari        -            -
> sc     Sardinian                 Latin             -            -
> sd     Sindhi                    Arabic            -            -
> se     Northern Sami             Latin             -            -
> sg     Sango                     Latin             -            -
> si     Sinhalese                 Sinhala           -            -
> sk     Slovak                    Latin             Yes          -
> sl     Slovenian                 Latin             Yes          -
> sm     Samoan                    Latin             -            -
> sn     Shona                     Latin             -            -
> so     Somali                    Latin             -            -
> sq     Albanian                  Latin             Planned      -
> sr     Serbian                   Cyrillic          -            Yes
> sr                               Latin             -            -
> ss     Swati                     Latin             -            -
> st     Southern Sotho            Latin             -            -
> su     Sundanese                 Latin             -            -
> sv     Swedish                   Latin             Yes          -
> sw     Swahili                   Latin             Planned      -
>
> ta     Tamil                     Tamil             Planned      -
> te     Telugu                    Telugu            -            -
> tg     Tajik                     Latin             -            -
> tk     Turkmen                   Latin             -            -
> tl     Tagalog                   Latin             -            -
> tl                               Tagalog           -            -
> tn     Tswana                    Latin             -            -
> to     Tonga                     Latin             -            -
> tr     Turkish                   Latin             -            -
> ts     Tsonga                    Latin             -            -
> tt     Tatar                     Cyrillic          -            -
> tw     Twi                       Latin             -            -
> ty     Tahitian                  Latin             -            -
>
> ug     Uighur                    Arabic            -            -
> ug                               Cyrillic          -            -
> ug                               Latin             -            -
> uk     Ukrainian                 Cyrillic          Yes          -
> ur     Urdu                      Arabic            -            -
> uz     Uzbek                     Cyrillic          -            -
> uz                               Latin             -            -
>
> ve     Venda                     Latin             -            -
> vi     Vietnamese                Latin             -            -
> vo     Volapuk                   Latin             -            -
>
> wa     Walloon                   Latin             Planned      Incomplete
> wo     Wolof                     Latin             -            -
>
> xh     Xhosa                     Latin             -            -
>
> yi     Yiddish                   Hebrew            -            -
> yo     Yoruba                    Latin             -            -
>
> za     Zhuang                    Latin             -            -
> zu     Zulu                      Latin             Planned      -
>
> Notes on Latin Languages
> ------------------------
>
> Any word that can be written using on of the Latin ISO-8859 character
> sets (ISO-8859-1,2,3,4,9,10,13,14,15,16) can be written, in decomposed
> form, using the ASCII characters, the 23 additional letters:
>
>      U+00C6 LATIN CAPITAL LETTER AE
>      U+00D0 LATIN CAPITAL LETTER ETH
>      U+00D8 LATIN CAPITAL LETTER O WITH STROKE
>      U+00DE LATIN CAPITAL LETTER THORN
>      U+00DE LATIN SMALL LETTER THORN
>      U+00DF LATIN SMALL LETTER SHARP S
>      U+00E6 LATIN SMALL LETTER AE
>      U+00F0 LATIN SMALL LETTER ETH
>      U+00F8 LATIN SMALL LETTER O WITH STROKE
>      U+0110 LATIN CAPITAL LETTER D WITH STROKE
>      U+0111 LATIN SMALL LETTER D WITH STROKE
>      U+0126 LATIN CAPITAL LETTER H WITH STROKE
>      U+0127 LATIN SMALL LETTER H WITH STROKE
>      U+0131 LATIN SMALL LETTER DOTLESS I
>      U+0138 LATIN SMALL LETTER KRA
>      U+0141 LATIN CAPITAL LETTER L WITH STROKE
>      U+0142 LATIN SMALL LETTER L WITH STROKE
>      U+014A LATIN CAPITAL LETTER ENG
>      U+014B LATIN SMALL LETTER ENG
>      U+0152 LATIN CAPITAL LIGATURE OE
>      U+0153 LATIN SMALL LIGATURE OE
>      U+0166 LATIN CAPITAL LETTER T WITH STROKE
>      U+0167 LATIN SMALL LETTER T WITH STROKE
>
>    and the 14 modifiers:
>
>      U+0300 COMBINING GRAVE ACCENT
>      U+0301 COMBINING ACUTE ACCENT
>      U+0302 COMBINING CIRCUMFLEX ACCENT
>      U+0303 COMBINING TILDE
>      U+0304 COMBINING MACRON
>      U+0306 COMBINING BREVE
>      U+0307 COMBINING DOT ABOVE
>      U+0308 COMBINING DIAERESIS
>      U+030A COMBINING RING ABOVE
>      U+030B COMBINING DOUBLE ACUTE ACCENT
>      U+030C COMBINING CARON
>      U+0326 COMBINING COMMA BELOW
>      U+0327 COMBINING CEDILLA
>      U+0328 COMBINING OGONEK
>
>    Which is a total of 37 additional Unicode code points.
>
>    All ISO-8859 character leaves the characters 0x00 - 0x19 and 0x80 -
> 0x99 unmapped as they are generally used as control characters.  Of
> those, 0x02 - 0x19 and 0x80 - 0x99 may be mapped to anything in Aspell.
> This is a total of 62 characters which can be remapped in any ISO-8859
> character set.  Thus, by remapping 37 of the 62 characters to the
> previously specifed Unicode code-points, any modified ISO-8859 character
> set can be used for any Latin languages covered by ISO-8859.  Of course
> decomposing every single accented character wastes a lot of space, so
> only characters that can be not be represented in the precomposed form
> should be broken up.  By using this trick it is possible to store
> foreign words in the correctly accented form in the dictionary even if
> the precomposed character is not in the current character set.
>
>    Any letter in the Unicode range U+0000 - U+0249, U+1E00..U+1EFF
> (Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B,
> and Latin Extended Additional) can be represented using around 175
> basic letters, and 25 modifiers which is less than 220 and can thus fit
> in an Aspell 8-bit character set.  Since this unicode range covers any
> possible Latin language this special character set can be used to
> reperesnt any word written using the Latin script if so desired.
>
> Hangeul
> -------
>
> Koren in generally written in hangeul or a mixture of hanja and hangeul.
> Aspell should be able to spell check the hangeul part of the writing.
> In Hangeul letters individual letters, known as jamo, are grouped
> together in syllable blocks.  Unicode provided code points for both jamo
> and the combined syllable block.  The syllable blocks will need to be
> decomposed into jamo in order for Aspell to spell check it.
>
> Syllabic
> ========
>
> Syllabic languages use a separate symbol for each syllable of the
> language.  Since most of them have more than 240 distinct characters
> Aspell can not support them as is.  However, all hope is not lost as
> Aspell will most likely be able to support them in the future.
>
> Code   Language Name   Script
> am     Amharic         Ethiopic
> cr     Cree            Canadian Syllabics
> ii     Sichuan Yi      Yi
> iu     Inuktitut       Canadian Syllabics
> oj     Ojibwa          Ojibwe
> om     Oromo           Ethiopic
> ti     Tigrinya        Ethiopic
>
> The Ethiopic Syllabary
> ----------------------
>
> Even though the Ethiopic script has more than 220 distinct characters
> with a little work Aspell can still handle it.  The idea is to split
> each character into two parts based on the matrix representation.  The
> first 3 bits will be the first part and could be mapped to `10000???'.
> The next 6 bits will be the second part and could be mapped to
> `11??????'.  The combined character will then be mapped with the upper
> bits coming first.  Thus each Ethiopic syllabary will have the form
> `11?????? 10000???'.  By mapping the first and second parts to separate
> 8-bit characters it is easy to tell which part represents the consonant
> and which part represents the vowel of the syllabary.  This encoding of
> the syllabary is far more useful to Aspell than if they were stored in
> UTF-8 or UTF-16.  In fact, the exiting suggestion strategy of Aspell
> will work well with this encoding with out any additional
> modifications.  However, additional improvements may be possible by
> taking advantage of the consonant-vowel structure of this encoding.
>
>    In fact, the split consonant-vowel representation may prove to be so
> useful that it may be beneficial to encode other syllabary in this
> fashion, even if they are less than 220 of them.
>
>    The code to break up a syllabary into the consonant-vowel parts does
> not exists as of Aspell 0.60.  However, it will be fairly easy to add
> it as part of the Unicode normalization process once that is written.
>
> The Yi Syllabary
> ----------------
>
> A very large syllabary with 819 distince symbols.  However, like
> Ethiopic, it should be possible to support this script by breaking it
> up.
>
> The Unified Canadian Aboriginal Syllabics
> -----------------------------------------
>
> Another very large syllabary.
>
> The Ojibwe Syllabary
> --------------------
>
> With only 120 distinct symbols, Aspell can actually support this one as
> is.  However, as previously mentioned, it may be beneficial to break it
> up into the consonant-vowel representation anyway.
>
> Unsupported
> ===========
>
> These languages, when written in the given script, are currently
> unsupported by Aspell for one reason or another.
>
> Code   Language Name   Script
> ja     Japanese        Japanese
> km     Khmer           Khmer
> ko     Korean          Hanja + Hangeul
> pi     Pali            Thai
> th     Thai            Thai
> zh     Chinese         Hanja
>
> The Thai and Khmer Scripts
> --------------------------
>
> The Thai and Khmer scripts presents a different problem for Aspell.  The
> problem is not that there are more than 220 unique symbols, but that
> there are no spaces between words.  This means that there is no easy way
> to split a sentence into individual words.  However, it is still
> possible to spell check these scripts, it is just a lot more difficult.
> I will be happy to work within someone who is interested in adding Thai
> or Khmer support to Aspell, but it is not likely something I will do in
> the foreseeable future.
>
> Languages which use Hànzi Characters
> ------------------------------------
>
> Hànzi Characters are used to write Chinese, Japanese, Korean, and were
> once used to write Vietnamese.  Each hànzi character represents a
> syllable of a spoken word and also has a meaning.  Since there are
> around 3,000 of them in common usage it is unlikely that Aspell will
> ever be able to support spell checking languages written using hànzi.
> However, I am not even sure if these languages need spell checking since
> hànzi characters are generally not entered in directly.  Furthermore
> even if Aspell could spell check hànzi the exiting suggestion strategy
> will not work well at all, and thus a completely new strategy will need
> to be developed.
>
> Japanese
> --------
>
> Modern Japanese is written in a mixture of "hiragana", "katakana",
> "kanji", and sometimes "romaji".  "Hiragana" and "katakana" are both
> syllabaries unique to Japan, "kanji" is a modified form of hànzi, and
> "romaji" uses the Latin alphabet.  With some work, Aspell should be
> able to check the non-kanji part of Japanese text.  However, based on
> my limited understanding of Japanese hiragana is often used at the end
> of kanji.  Thus if Aspell was to simply separate out the hiragana from
> kanji it would end up with a lot of word endings which are not proper
> words and will thus be flagged as misspellings.  However, this can be
> fairly easily rectified as text is tokenized into words before it is
> converted into Aspell's internal encoding.  In fact, some Japanese text
> is written in entirely in one script.  For example books for children
> and foreigners are sometimes written entirely in hiragana.  Thus,
> Aspell could prove at least somewhat useful for spell checking Japanese.
>
> Languages Written in Multiple Scripts
> =====================================
>
> Aspell should be able to check text written in the same language, but in
> multiple scripts, with some work.  If the number of unique symbols in
> both scripts is less than 220 than a special character set can be used
> to allow both scripts to be encoding in the same dictionary.  However
> this may not be the most efficient solution.  An alternate solution is
> to store each script in its own dictionary and allow Aspell to chose
> the correct dictionary based on which script the given word is written
> in.  Aspell currently does not support this mode of spell checking
> however it is something that I hope to eventually support.
>
> Notes on Planned Dictionaries
> =============================
>
> be   Belarusian     Ispell Dictionary available
> bn   Bengali        Unoffical Aspell Dictionary available
>                     `http://www.bengalinux.org/downloads/'
> et   Estonian       Ispell Dictionary available
> fi   Finnish        Ispell Dictionary available
> gd   Scottish       Ispell Dictionary available.
>      Gaelic         `http://packages.debian.org/unstable/text/igaelic'
> gv   Manx           Ispell Dictionary available.
>                     `http://packages.debian.org/unstable/text/imanx'
> he   Hebrew         Ispell Dictionary available
> hu   Hungarian      MySpell dictionary expanded to over 500 MB.  Will add
>                     once affix support is worked into the dictionary
>                     package system.
> lb   Luxembourgish  MySpell dictionary planned.
> lt   Lithuanian     MySpell dictionary expanded to over 500 MB.  Will add
>                     once affix support is worked into the dictionary
>                     package system.
> mt   Maltese        Unofficial Aspell Dictionary available, but broken
>                     link to source.
>                     `http://linux.org.mt/article/spellcheck'
> sw   Albanian       Ispell Dictionary available
> sw   Swahili        Available at
>                     `http://sourceforge.net/projects/translate'.  Offical
>                     version comming soon.
> ta   Tamil          Word list available at
>                     `http://www.developer.thamizha.com/spellchecker/index.html'.
>                     Working with them to create an Aspell dictionary.
> wa   Walloon        Ispell Dictionary available
> zu   Zulu           Available at
>                     `http://sourceforge.net/projects/translate'.  Offical
>                     version comming soon.
>
> References
> ==========
>
> The information in this chapter was gathered from numerous sources,
> including:
>
>    * ISO 639-2 Registration Authority,
>      `http://www.loc.gov/standards/iso639-2/'
>
>    * Languages and Scripts (Offical Unicode Site),
>      `http://www.unicode.org/onlinedat/languages-scripts.html'
>
>    * Omniglot - a guide to written language, `http://www.omniglot.com/'
>
>    * Winkipedia - The Free Encyclopedia, `http://wikipedia.org/'
>
>    * Ethnologue - Languages of the World, `http://www.ethnologue.com/'
>
>    * World Languages - The Ultimate Language Store,
>      `http://www.worldlanguage.com/'
>
>    * South African Languages Web, `http://www.languages.web.za/'
>
>    * The Languages and Writing Systems of Africa (Global Advisor
>      Newsletter), `http://www.intersolinc.com/newsletters/africa.htm'
>
>
>    Special thanks goes to Era Eriksson for helping me the information in
> this chapter.
>
>
> Language Related Issues
> ***********************
>
> Here are some language related issues that a good spell checker needs to
> handle.  If you have any more information about any of these issues, or
> of a new issue not discussed here, please email me at <kevina at gnu.org>.
>
> German Sharp S
> ==============
>
> The German Sharp S or Eszett does not have an uppercase equivalent.
> Instead when `ß' is converted to `SS'.  The conversion of `ß' to `SS'
> requires a special rule, and increases the length of a word, thus
> disallowing inplace case conversion.  Furthermore, my general rule of
> converting all words to lowercase before looking them up in the
> dictionary won't work because the conversion of `SS' to lowercase is
> ambiguous; it can be `ss' or `ß'.  I do plan on dealing with this
> eventually, however.
>
> Compound Words
> ==============
>
> In some languages, such as German, it is acceptable to string two words
> together, thus forming a compound word.  However, there are rules to
> when this can be done.  Furthermore, it is not always sufficient to
> simply concatenate the two words.  For example, sometimes a letter is
> inserted between the two words.  I tried implementing support for
> compound words in Aspell but it was too limiting and no one used it.
> Before I try implementing it again I want to know all the issues
> involved.
>
> Context Sensitive Spelling
> ==========================
>
> In some language, such as Luxembourgish, the spelling of a word depends
> on which words surround it.  For example the the letter `n' at the end
> of a word will disappear if it is followed by another word starting
> with a certain letter such as an `s'.  However, it can probably get
> more complicated than that.  I would like to know how complicated before
> I attempt to implement support for context sensitive spelling.
>
> Unicode Normalization
> =====================
>
> Because Unicode contains a large number of precomposed characters there
> are multiple ways a character can be represented.  For example letter
> a* can either be represented as
>
>      U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
> or
>      U+0061 LATIN SMALL LETTER A + U+030A COMBINING RING ABOVE
>
>    By performing normalization first Aspell will only see one of these
> representations.  The exact form of normalization depends on the
> language.  Give the choice of
>
>   1. Precomposed character
>
>   2. Base letter + combining character(s)
>
>   3. Base letter only
>
> if the precomposed charter is in the target character set then (1), if
> both the base and combing character is present than (2), otherwise (3).
>
> Words With Spaces or other Symbols in Them
> ==========================================
>
> Many languages, including English, have words with non-letter symbols in
> them.  For example the apostrophe.  These symbols generally appear in
> the middle of a word, but they can also appear at the end, such as in an
> abbreviation.  If a symbol can _only_ appear as part of a word than
> Aspell can treat it as if it were a letter.
>
>    However, the problem is most of these symbols have other uses.  For
> example, the apostrophe is often used as a single quote and the
> abbreviations marker is also used as a period.  Thus, Aspell can not
> blindly treat them as if they were letters.
>
>    Aspell currently handles the case where the symbol can only appear in
> the middle of the word fairly well.  It simply assumes that if there is
> a letter both before and after the symbol than it is part of the word.
> This works most of the time but it is not fool proof.  For example,
> suppose the user forgot to leave a space after the period:
>
>        ... and the dog went up the tree.Then the cat ...
>
> Aspell would think "tree.Then" is one word.  A better solution might be
> to then try to check "tree" and "Then" separately.  But what if one of
> them is not in the dictionary?  Should Aspell assume "tree.Then" is one
> word?
>
>    The case where the symbol can appear at the beginning or end of the
> word is more difficult to deal with.  The symbol may or may not
> actually be part of the word.  Aspell currently handles this case by
> first trying to spell check the word with the symbol and if that fails,
> try it without.  The problem is, if the word is misspelled, should
> Aspell assume the symbol belongs with the word or not?  Currently
> Aspell assumes it does, which is not always the correct thing to do.
>
>    Numbers in words present a different challenge to Aspell.  If Aspell
> treats numbers as letters than every possible number a user might write
> in a document must be specified in the dictionary.  This could be
> easily be solved by having special code to assume all numbers are
> correctly spelled.  But what about something like "4th".  Since the
> "th" suffix can appear after any number we are left with the same
> problem.  The solution would be to have a special symbol for "any
> number".
>
>    Words with spaces in them, such as foreign phrases, are even more
> trouble to deal with.  The basic problem is that when tokonizing a
> string there is no good way to keep phrases together. One solution is to
> use trial and error.  If a word is not in the dictionary try grouping it
> with the previous or next word and see if the combined word is the
> dictionary.  But what if the combined word is not, should the misspelled
> word be grouped when looking for suggestions?  One solution is to also
> store each part of the phrase in the dictionary, but tag it as part of a
> phrase and not an independent word.
>
>    To further complicate things, most applications that use spell
> checkers are accustom to parsing the document themselves and sending it
> to the spell checker a word at a time.  In order to support word with
> spaces in them a more complicated interface will be required.
>
>
> Notes on 8-bit Characters
> *************************
>
> There is a very good reason I use 8-bit characters in Aspell. Speed and
> simplicity. While many parts of my code can fairly be easily be
> converted to some sort of wide character as my code is clean. Other
> parts can not be.
>
>    One of the reasons because is many, many places I use a direct lookup
> to find out various information about characters. With 8-bit characters
> this is very feasible because there is only 256 of them. With 16-bit
> wide characters this will waste a LOT of space. With 32-bit characters
> this is just plain impossible. Converting the lookup tables to some
> other form, while certainly possible, will degrade performance
> significantly.
>
>    Furthermore, some of my algorithms relay on words consisting only on
> a small number of distinct characters (often around 30 when case and
> accents are not considered). When the possible character can consist of
> any Unicode character this number because several thousand, if that. In
> order for these algorithms to still be used some sort of limit will
> need to be placed on the possible characters the word can contain. If I
> impose that limit, I might as well use some sort of 8-bit characters
> set which will automatically place the limit on what the characters can
> be.
>
>    There is also the issue of how I should store the word lists in
> memory? As a string of 32 bit wide characters. Now that is using up 4
> times more memory than charters would and for languages that can fit
> within an 8-bit character that is, in my view, a gross waste of memory.
> So maybe I should store them is some variable width format such as
> UTF-8. Unfortunately, way, way to many of may algorithms will simply
> not work with variable width characters without significant
> modification which will very likely degrade performance. So the
> solution is to work with the characters as 32-bit wide characters and
> than convert it to a shorter representation when storing them in the
> lookup tables. Now than can lead to an inefficiency. I could also use
> 16 bit wide characters however that may not be good enough to hold all
> of future versions of Unicode and it has the same problems.
>
>    As a response to the space waste used by storing word lists in some
> sort of wide format some one asked:
>
>      Since hard drive are cheaper and cheaper, you could store
>      dictionary in a usable (uncompressed) form and use it directly
>      with memory mapping. Then the efficiency would directly depend on
>      the disk caching method, and only the used part of the
>      dictionaries would relay be loaded into memory. You would no more
>      have to load plain dictionaries into main memory, you'll just want
>      to compute some indexes (or something like that) after mapping.
>
>    However, the fact of the matter is that most of the dictionary will
> be read into memory anyway if it is available. If it is not available
> than there would be a good deal of disk swaps. Making characters 32-bit
> wide will increase the change that there are more disk swap. So the
> bottom line is that it will be cheaper to convert the characters from
> something like UTF-8 into some sort of wide character. I could also use
> some sort of disk space lookup table such as the Berkeley Database.
> However this will *definitely* degrade performance.
>
>    The bottom line is that keeping Aspell 8-bit internally is a very
> well though out decision that is not likely to change any time soon.
> Fell free to challenge me on it, but, don't expect me to change my mind
> unless you can bring up some point that I have not thought of before
> and quite possible a patch to solve cleanly convert Aspell to Unicode
> internally with out a serious performance lost OR serious memory usage
> increase.
>
> --
> http://kevin.atkinson.dhs.org
>
>
>
> _______________________________________________
> utf-8 mailing list
> utf-8 at freedesktop.org
> https://freedesktop.org/cgi-bin/mailman/listinfo/utf-8
>



More information about the utf-8 mailing list