[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz

Anish Patil apatil at redhat.com
Tue Jul 3 23:26:31 PDT 2012


Hi Behdad, 

>>As promised, here is the word-list data extracted from various language
>>Wikipedias, ready for public consumption.

Congratulations !!!

>>There are 63 languages included.  Chinese and Japanese (zh and ja) are
>>intentionally left out as they were too big / not so interesting.  Other than
>>that, English is particularly large, as expected, and the rest vary in size,
>>from a few thousand to tens of millions of unique words.

For some of the indian languages wiki pedia words contain spelling mistakes, hope that will not affect your work.
Marathi Word list contains words like "अ‍ॅक्सेसदिनांक",अ‍ॅरिझोना which are incorrect. 

Cheers,
Anish P. 



More information about the HarfBuzz mailing list