[HarfBuzz] Language Modularization?

Javier SOLA javier at khmeros.info
Fri Nov 7 16:51:30 PST 2008


Hi Tep, Ed,

I am not subscribed to the harfbuzz llist, so this message will not make 
it there.

I can confirm that syllable line-breaking is not correct for either 
Khmer or languages written with Myanmar script, including Burmese.

Syllable breaking is done in Burmese in newspapers, with very thin 
columns, but is not desirable. In Khmer it is not acceptable. I don't 
know about Lao, but I assume that it would always be preferable to do 
breaks in words, and not syllables. Again, newspaper practices do not 
indicate good script usage, but their own constraints.

We have been testing (with Jens Herden, in cc) dictionary-based line 
breaking for Khmer in ICU, copying the algorithm that is there for Thai. 
We will be integrating it in mainstream  OpenOffice as soon as possible 
(OpenOffice is now upgrading to ICU 4.0, which makes it much easier).

For Burmese is more complex, as graphemes and syllables are different 
for them (in many cases one syllable spans two graphemes). UNICODE for 
Myanmar is not yet final (character order), so it is still difficult to 
do any work in this front). Final order needs to take into account 
several minority languages (Sgaw Karen, Shan, Mon, etc.), and it is not 
easy. A new proposal is being prepared.

It is important to understand that so far (for Thai), line-breaking and 
word-boundaries are broken together (same places). The result of the 
line-breaking is used by spell-checkers. Using syllable breaks divides 
the words in pieces and breaks spell-checking, while dictionary based 
does correct spell-checking (we have already tested in OpenOffice).

Regards,

Javier



Theppitak Karoonboonyanan wrote:
> On Fri, Nov 7, 2008 at 3:53 PM, Jens Herden <jens at khmeros.info> wrote:
>   
>> On Freitag 07 November 2008, Theppitak Karoonboonyanan wrote:
>>     
>>> It seems only Thai needs dictionary-based algorithm. Others don't.
>>>       
>> AFAIK this is not correct.
>>     
>
> Hmm.. But that's summarized from what I've been told in regional
> conferences, with report papers.
>
> Probably, the advantage of Indic encoding scheme has been
> over-focused when talking to a Thai guy like me.. ;-)
>
>   
>>> - Lao, the closest implementation to Thai, has simplified its writing
>>> system to be phonetic-based. Word break can be achieved solely by syllabic
>>> rules.
>>>
>>> - Other scripts, including Myanmar and Khmer, have adopted Indic
>>>   encoding scheme, which already has intrinsic information on syllable
>>>   boundaries. So, word break can also be achieved by rule-based
>>>   approach. (Confirmed for Myanmar, at least.)
>>>       
>> While it is easy to find the syllable breaks in Khmer it is not easy to find
>> the word breaks, because many words are made by more than one syllable.
>> You need a dictionary based approach for Khmer for good word breaking though.
>>     
>
> Does this include line wrapping? Is wrapping lines at syllable
> boundaries OK for Khmer? (I've been told it's acceptable for other
> languages.)
>
>   




More information about the HarfBuzz mailing list