[HarfBuzz] Language Modularization?

Jens Herden jens at khmeros.info
Fri Nov 7 00:53:01 PST 2008


On Freitag 07 November 2008, Theppitak Karoonboonyanan wrote:
> On Thu, Nov 6, 2008 at 11:12 PM, Ed Trager <ed.trager at gmail.com> wrote:
> > Have you and your friends ever thought about writing a new
> > *extensible* word segmentation system to replace libThai that would
> > handle not only Thai, but also Lao, Khmer, Burmese and eventually even
> > other orthographies of Southeast Asia such as คำเมือง ?
>
> What I have got so far from my neighbor countries may be summarized
> into a plain API like pango_break() or pango_get_log_attr().
>
> It seems only Thai needs dictionary-based algorithm. Others don't.

AFAIK this is not correct.

>
> - Lao, the closest implementation to Thai, has simplified its writing
> system to be phonetic-based. Word break can be achieved solely by syllabic
> rules.
>
> - Other scripts, including Myanmar and Khmer, have adopted Indic
>   encoding scheme, which already has intrinsic information on syllable
>   boundaries. So, word break can also be achieved by rule-based
>   approach. (Confirmed for Myanmar, at least.)

While it is easy to find the syllable breaks in Khmer it is not easy to find 
the word breaks, because many words are made by more than one syllable.
You need a dictionary based approach for Khmer for good word breaking though.

Cheers Jens



More information about the HarfBuzz mailing list