[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz

Adam Twardoch (List) list.adam at twardoch.com
Tue Jul 3 11:08:48 PDT 2012


Brilliant!

I was writing code to do exactly the very same thing.

Many thanks to you!

A.

On 12-07-03 19:28, Behdad Esfahbod wrote:
> Hi,
>
> As promised, here is the word-list data extracted from various language
> Wikipedias, ready for public consumption.
>
> There are 63 languages included.  Chinese and Japanese (zh and ja) are
> intentionally left out as they were too big / not so interesting.  Other than
> that, English is particularly large, as expected, and the rest vary in size,
> from a few thousand to tens of millions of unique words.
>
> Word frequency data is included in separate files.  The format is bare
> minimum.  Ie. there is no format.  One word per line, sorted by decreasing
> frequencies.  Bzip2ed.
>
> The canonical source of the data is here:
>
>   http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list
>
> With mirrors, including one big bzip2 file, here:
>
>   http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/
>   http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/
>
> License of the data is CC-BY_SA as is Wikipedia.  I will publish the code
> generating these at some point.  Thanks Roozbeh for extracting these.
>
> Cheers,
> behdad
> _______________________________________________
> HarfBuzz mailing list
> HarfBuzz at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/harfbuzz


-- 

May success attend your efforts,
-- Adam Twardoch
(Remove "list." from e-mail address to contact me directly.)




More information about the HarfBuzz mailing list