[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz

Rolf Langenhuijzen rolf.langenhuijzen at xs4all.nl
Tue Jul 3 11:17:08 PDT 2012


Very nice!


On Jul 3, 2012, at 7:28 PM, Behdad Esfahbod wrote:

> Hi,
> 
> As promised, here is the word-list data extracted from various language
> Wikipedias, ready for public consumption.
> 
> There are 63 languages included.  Chinese and Japanese (zh and ja) are
> intentionally left out as they were too big / not so interesting.  Other than
> that, English is particularly large, as expected, and the rest vary in size,
> from a few thousand to tens of millions of unique words.
> 
> Word frequency data is included in separate files.  The format is bare
> minimum.  Ie. there is no format.  One word per line, sorted by decreasing
> frequencies.  Bzip2ed.
> 
> The canonical source of the data is here:
> 
>  http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list
> 
> With mirrors, including one big bzip2 file, here:
> 
>  http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/
>  http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/
> 
> License of the data is CC-BY_SA as is Wikipedia.  I will publish the code
> generating these at some point.  Thanks Roozbeh for extracting these.
> 
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz




More information about the HarfBuzz mailing list