[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz

Khaled Hosny khaledhosny at eglug.org
Wed Jul 4 09:17:37 PDT 2012


Very nice :)

I gave it a quick glance and I noticed that (some?) punctuation marks
are considered part of the word, e.g. "ويكيبيديا" and "ويكيبيديا،" are
seen as two different words.

Also combining marks are not ignored, at least in case of Arabic this
might affect the results greatly since vowel marks are not uniformly
applied and the same word might appear several times with different sets
of vowel marks applied to it.

Regards,
 Khaled

On Tue, Jul 03, 2012 at 01:28:20PM -0400, Behdad Esfahbod wrote:
> Hi,
> 
> As promised, here is the word-list data extracted from various language
> Wikipedias, ready for public consumption.
> 
> There are 63 languages included.  Chinese and Japanese (zh and ja) are
> intentionally left out as they were too big / not so interesting.  Other than
> that, English is particularly large, as expected, and the rest vary in size,
> from a few thousand to tens of millions of unique words.
> 
> Word frequency data is included in separate files.  The format is bare
> minimum.  Ie. there is no format.  One word per line, sorted by decreasing
> frequencies.  Bzip2ed.
> 
> The canonical source of the data is here:
> 
>   http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list
> 
> With mirrors, including one big bzip2 file, here:
> 
>   http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/
>   http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/
> 
> License of the data is CC-BY_SA as is Wikipedia.  I will publish the code
> generating these at some point.  Thanks Roozbeh for extracting these.
> 
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz



More information about the HarfBuzz mailing list