[HarfBuzz] Hackfest report
Pravin Satpute
psatpute at redhat.com
Mon May 28 00:32:31 PDT 2012
On सोमवार 28 मे 2012 07:04 म.पू., Behdad Esfahbod wrote:
> Hello HarfBuzz lists,
>
> I promised to write a short report about the hackfest earlier this month.
> Here's it is.
>
> Jonathan Kew (Mozilla) and I met at the Google Zurich offices on May 9..11 for
> the HarfBuzz Massala Hackfest. We got together for three days of 12+ hours
> intense hacking on the new HarfBuzz Indic shaper, using Wikipedia word list as
> test suite.
I have few months back extracted words from wikipedia dump for
indic-typing-booster project and found that most of the words in
wikipedia are auto transliterated and that is why there was lots of
invalid combinations.
Recently i had conversion with Wikipedia team on the word list and
answer received was use wikisource, it provides digitized book. These
contains mostly validated words from books.
>
> We started with the Devanagari script, testing against Uniscribe (Windows 7's
> implementation). Initially we were failing on 35% of the words in the list.
> Three days, 86 commits, and dozens cups of coffee later, we got down to 0.08%.
> Out of the ~700,000 words, we disagree only on 560. Of those 560, many are
> invalid or meaningless Devanagari sequences, not character combinations that
> ever occur in correctly-spelled words. In these cases we are less concerned to
> precisely match Uniscribe's behavior.
Excellent achievement, congrats :)
I will go through 560 words and will update you if anything interesting.
>
> We discovered a number of bugs or peculiarities in Uniscribe. We can do
> better in some of those cases (and we do). But for testing purposes, we added
> a "uniscribe-bug-compatibility" mode to the Indic shaper. The numbers above
> were in that mode.
>
> We then tested Gujarati, a script very similar to Devanagari. Failures were
> at a surprisingly low 0.015%!
I think Punjabi and Tamil should also give good results with current fix
for Devanagari script.
>
> To summarize:
>
> Last year I conjectured that if we have an extensive word list for each
> script, we can test our shaper against Uniscribe, and use the number of
> failures, and the failing cases themselves, to guide us perfecting our shaper.
> I asked the internationalization team at Google to create a word list per
> language out of Wikipedia content, and I received those lists last month. In
> this hackfest we put the idea to test, and it worked out very well! From now
> on, we can take one script at a time, look at the failures, hammer the number
> down to sub 0.1% and move to the next.
I will say this is Best approach. Thinking all script at once is pain.
>
>
> Good times. We should do this again. Tentatively planning for late July in
> Toronto when Mozilla developers will be in town.
Might be more people cant join face to face meeting, at least i will try
to be more active on IRC during same time.
>
>
> behdad
>
> PS. I'm leaning towards shutting down the harfbuzz-indic list and using the
> main list for all communication. Any objections?
Might be not required presently but when we will deploy harfbuzz-ng in
various project, i..e pango, icu might be that time we will need it.
Best Regards,
Pravin Satpute
More information about the HarfBuzz
mailing list