[HarfBuzz] Hackfest report
behdad at behdad.org
Sun May 27 18:34:29 PDT 2012
Hello HarfBuzz lists,
I promised to write a short report about the hackfest earlier this month.
Here's it is.
Jonathan Kew (Mozilla) and I met at the Google Zurich offices on May 9..11 for
the HarfBuzz Massala Hackfest. We got together for three days of 12+ hours
intense hacking on the new HarfBuzz Indic shaper, using Wikipedia word list as
We started with the Devanagari script, testing against Uniscribe (Windows 7's
implementation). Initially we were failing on 35% of the words in the list.
Three days, 86 commits, and dozens cups of coffee later, we got down to 0.08%.
Out of the ~700,000 words, we disagree only on 560. Of those 560, many are
invalid or meaningless Devanagari sequences, not character combinations that
ever occur in correctly-spelled words. In these cases we are less concerned to
precisely match Uniscribe's behavior.
We discovered a number of bugs or peculiarities in Uniscribe. We can do
better in some of those cases (and we do). But for testing purposes, we added
a "uniscribe-bug-compatibility" mode to the Indic shaper. The numbers above
were in that mode.
We then tested Gujarati, a script very similar to Devanagari. Failures were
at a surprisingly low 0.015%!
We then tested Bengali, a script slightly different. Failures were at 20%,
which was expected since we were not implementing any features of Bengali not
present in Devanagari. Two changes later and Bengali was just under 3%.
We then tested Malayalam, a script having many features not present in
Devanagari. Failures were at 14%.
Last year I conjectured that if we have an extensive word list for each
script, we can test our shaper against Uniscribe, and use the number of
failures, and the failing cases themselves, to guide us perfecting our shaper.
I asked the internationalization team at Google to create a word list per
language out of Wikipedia content, and I received those lists last month. In
this hackfest we put the idea to test, and it worked out very well! From now
on, we can take one script at a time, look at the failures, hammer the number
down to sub 0.1% and move to the next.
Note that the percentages here are number of words failing from the corpus.
When we get sub 0.1%, the remaining failures are mostly obscure and unusual
sequences of letters that are not expected to occur in normal text.
If we take into account the frequency of words, the failure percentage is well
below this. I will calculate that number later, and set a goal of achieving
0.0001% failure rate for the frequency-adjusted data. That would mean, one in
a million misrendering on real world text.
Perhaps even bigger achievement of the hackathon is: thanks to Jonathan's
help, I now understand how the Indic scripts work, know what the technical
terms mean, and can reason about them!
Good times. We should do this again. Tentatively planning for late July in
Toronto when Mozilla developers will be in town.
PS. I'm leaning towards shutting down the harfbuzz-indic list and using the
main list for all communication. Any objections?
More information about the HarfBuzz