[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz

Amir E. Aharoni amir.aharoni at mail.huji.ac.il
Wed Jul 4 00:08:19 PDT 2012


2012/7/4 Anish Patil <apatil at redhat.com>:
>>>There are 63 languages included.  Chinese and Japanese (zh and ja) are
>>>intentionally left out as they were too big / not so interesting.  Other than
>>>that, English is particularly large, as expected, and the rest vary in size,
>>>from a few thousand to tens of millions of unique words.
>
> For some of the indian languages wiki pedia words contain spelling mistakes, hope that will not affect your work.
> Marathi Word list contains words like "अ‍ॅक्सेसदिनांक",अ‍ॅरिझोना which are incorrect.

This is true for Wikipedia in all languages. It may be incorrect with
regards to standard spelling set by a government or a language
academy, but it may be common in real life, so it is still useful for
statistics. It may also point to technical issues with fonts or
keyboards, that make people write incorrectly - for example, the right
letter may not appear on the common keyboard layout, or a
transliteration input method may have bugs.

In the particular case of Marathi, I know somebody who is working on
improving the spelling in Wikipedia. I'll gladly connect you, if
you're interested.

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬



More information about the HarfBuzz mailing list