[HarfBuzz] ANN: Wikipedia test data for testing HarfBuzz

Behdad Esfahbod behdad at behdad.org
Tue Jul 3 10:28:20 PDT 2012


Hi,

As promised, here is the word-list data extracted from various language
Wikipedias, ready for public consumption.

There are 63 languages included.  Chinese and Japanese (zh and ja) are
intentionally left out as they were too big / not so interesting.  Other than
that, English is particularly large, as expected, and the rest vary in size,
from a few thousand to tens of millions of unique words.

Word frequency data is included in separate files.  The format is bare
minimum.  Ie. there is no format.  One word per line, sorted by decreasing
frequencies.  Bzip2ed.

The canonical source of the data is here:

  http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list

With mirrors, including one big bzip2 file, here:

  http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/
  http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/

License of the data is CC-BY_SA as is Wikipedia.  I will publish the code
generating these at some point.  Thanks Roozbeh for extracting these.

Cheers,
behdad



More information about the HarfBuzz mailing list