[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz
Rolf Langenhuijzen
rolf.langenhuijzen at xs4all.nl
Tue Jul 3 11:17:08 PDT 2012
Very nice!
On Jul 3, 2012, at 7:28 PM, Behdad Esfahbod wrote:
> Hi,
>
> As promised, here is the word-list data extracted from various language
> Wikipedias, ready for public consumption.
>
> There are 63 languages included. Chinese and Japanese (zh and ja) are
> intentionally left out as they were too big / not so interesting. Other than
> that, English is particularly large, as expected, and the rest vary in size,
> from a few thousand to tens of millions of unique words.
>
> Word frequency data is included in separate files. The format is bare
> minimum. Ie. there is no format. One word per line, sorted by decreasing
> frequencies. Bzip2ed.
>
> The canonical source of the data is here:
>
> http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list
>
> With mirrors, including one big bzip2 file, here:
>
> http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/
> http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/
>
> License of the data is CC-BY_SA as is Wikipedia. I will publish the code
> generating these at some point. Thanks Roozbeh for extracting these.
>
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
More information about the HarfBuzz
mailing list