[HarfBuzz] Language Modularization?

Theppitak Karoonboonyanan thep at linux.thai.net
Thu Nov 6 17:03:45 PST 2008


On Thu, Nov 6, 2008 at 11:12 PM, Ed Trager <ed.trager at gmail.com> wrote:

> Have you and your friends ever thought about writing a new
> *extensible* word segmentation system to replace libThai that would
> handle not only Thai, but also Lao, Khmer, Burmese and eventually even
> other orthographies of Southeast Asia such as คำเมือง ?

What I have got so far from my neighbor countries may be summarized
into a plain API like pango_break() or pango_get_log_attr().

It seems only Thai needs dictionary-based algorithm. Others don't.

- Lao, the closest implementation to Thai, has simplified its writing system
  to be phonetic-based. Word break can be achieved solely by syllabic
  rules.

- Other scripts, including Myanmar and Khmer, have adopted Indic
  encoding scheme, which already has intrinsic information on syllable
  boundaries. So, word break can also be achieved by rule-based
  approach. (Confirmed for Myanmar, at least.)

- Lanna (คำเมือง you mentioned) should use similar approach to
  Myanmar. I don't know much in its details.

> Ideally, such a system would itself allow for "pluggable" methods, and
> would be fully based on Unicode.  So if someone invents a
> better/faster/smaller/more accurate algorithm for Thai segmentation,
> they could just wrap their algorithm in a class that would just plug
> in to such a system.

Pango already provides such excellent framework, I think.
And as Behdad said, Harfbuzz is just for shaping. It may be
out of scope here.

> Such a system would also provide standard containers for the
> dictionaries needed for segmentation of Thai, Khmer, and others.
>
> What do you and others think?

Probably, you may rephrase the problem as defining rule-based
framework, rather than dictionary-based. And that framework
resides under a more generic API like pango_break(), which
also allows dictionary-based implementations.

> Would there be interest in organizing a conference to examine these
> issues and work collaboratively to provide a unified solution?

I don't think much complexity is needed in such framework, unless
you want to create some syntactic building blocks for describing
the rules, for easily adding new languages. Other than that, rules
can be hard-coded, I supposed.

Regards,
-- 
Theppitak Karoonboonyanan
http://linux.thai.net/~thep/


More information about the HarfBuzz mailing list