[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz
Adam Twardoch (List)
list.adam at twardoch.com
Tue Jul 3 11:08:48 PDT 2012
Brilliant!
I was writing code to do exactly the very same thing.
Many thanks to you!
A.
On 12-07-03 19:28, Behdad Esfahbod wrote:
> Hi,
>
> As promised, here is the word-list data extracted from various language
> Wikipedias, ready for public consumption.
>
> There are 63 languages included. Chinese and Japanese (zh and ja) are
> intentionally left out as they were too big / not so interesting. Other than
> that, English is particularly large, as expected, and the rest vary in size,
> from a few thousand to tens of millions of unique words.
>
> Word frequency data is included in separate files. The format is bare
> minimum. Ie. there is no format. One word per line, sorted by decreasing
> frequencies. Bzip2ed.
>
> The canonical source of the data is here:
>
> http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list
>
> With mirrors, including one big bzip2 file, here:
>
> http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/
> http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/
>
> License of the data is CC-BY_SA as is Wikipedia. I will publish the code
> generating these at some point. Thanks Roozbeh for extracting these.
>
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz
--
May success attend your efforts,
-- Adam Twardoch
(Remove "list." from e-mail address to contact me directly.)
More information about the HarfBuzz
mailing list