[HarfBuzz] Hackfest report

Mon May 28 00:32:31 PDT 2012

On सोमवार 28 मे 2012 07:04 म.पू., Behdad Esfahbod wrote:
> Hello HarfBuzz lists,
>
> I promised to write a short report about the hackfest earlier this month.
> Here's it is.
>
> Jonathan Kew (Mozilla) and I met at the Google Zurich offices on May 9..11 for
> the HarfBuzz Massala Hackfest.  We got together for three days of 12+ hours
> intense hacking on the new HarfBuzz Indic shaper, using Wikipedia word list as
> test suite.
I have few months back extracted words from wikipedia dump for
indic-typing-booster project and found that most of the words in
wikipedia are auto transliterated and that is why there was lots of
invalid combinations.

Recently i had conversion with Wikipedia team on the word list and
answer received was use wikisource, it provides digitized book. These
contains mostly validated words from books.

>
> We started with the Devanagari script, testing against Uniscribe (Windows 7's
> implementation).  Initially we were failing on 35% of the words in the list.
> Three days, 86 commits, and dozens cups of coffee later, we got down to 0.08%.
>  Out of the ~700,000 words, we disagree only on 560.  Of those 560, many are
> invalid or meaningless Devanagari sequences, not character combinations that
> ever occur in correctly-spelled words. In these cases we are less concerned to
> precisely match Uniscribe's behavior.
Excellent achievement, congrats :)
I will go through 560 words and will update you if anything interesting.
>
> We discovered a number of bugs or peculiarities in Uniscribe.  We can do
> better in some of those cases (and we do).  But for testing purposes, we added
> a "uniscribe-bug-compatibility" mode to the Indic shaper.  The numbers above
> were in that mode.
>
> We then tested Gujarati, a script very similar to Devanagari.  Failures were
> at a surprisingly low 0.015%!

I think Punjabi and Tamil should also give good results with current fix
for Devanagari script.
>
> To summarize:
>
> Last year I conjectured that if we have an extensive word list for each
> script, we can test our shaper against Uniscribe, and use the number of
> failures, and the failing cases themselves, to guide us perfecting our shaper.
>  I asked the internationalization team at Google to create a word list per
> language out of Wikipedia content, and I received those lists last month.  In
> this hackfest we put the idea to test, and it worked out very well!  From now
> on, we can take one script at a time, look at the failures, hammer the number
> down to sub 0.1% and move to the next.

I will say this is Best approach. Thinking all script at once is pain.
>
>
> Good times.  We should do this again.  Tentatively planning for late July in
> Toronto when Mozilla developers will be in town.

Might be more people cant join face to face meeting, at least i will try
to be more active on IRC during same time.

>
>
> behdad
>
> PS.  I'm leaning towards shutting down the harfbuzz-indic list and using the
> main list for all communication.  Any objections?

Might be not required presently but when we will deploy harfbuzz-ng in
various project, i..e pango, icu might be that time we will need it.

Best Regards,
Pravin Satpute