Adding Extension for Experimental Thai Spelling

Thu Sep 27 10:55:36 PDT 2012

On Thu, 27 Sep 2012 21:08:13 +0700
Nathan Wells <sungkhum at gmail.com> wrote:

> Firstly, you are right, I was mistaken about ICU and the breakiterator
> working for sentences (I just tried it right now and it does work,
> but just not with the normal "khan" or "period" of Khmer rather it
> works with Latin sentence markers which is not enough).  I had
> thought when we put in the code for the breakiterator that it also
> covered the sentence, but I guess not (I will work towards getting it
> working for Khmer).

It may be worth modifying the CLDR definition - sentence breaks can be
customised, though it is presently only done for Greek.  However, if
you want Khmer *sentence* rather than *clause* breaking, it will need a
lot of work - papers are still being published on breaking Thai into
sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ).

> In response to your comments:
> 
> > 1) The user always marks word breaks with ZWSP.
> > In this case, the ideal is to switch off the break iterator for the
> > language.
> 
> 
> There is some truth to this - and that is why I had it as my last
> option (just turning the whole thing off). But the ICU breakiterator
> for Khmer actually works quite well with normal language - it breaks
> down when there are proper names. So turning it off is an option, but
> not the most ideal solution. Some users will continue to always mark
> breaks with a ZWSP (for full control), but I also think having the
> option to turn it off for more complex sentences would be ideal.
> 
> > 2) The user never marks word breaks.
> > In this case, the user is totally dependent on the break iterator,
> > and cannot be helped when it fails.
> 
> As I said above, I think a both/and solution would be idea for Khmer.
> But if in the end it would work better for Thai to have and "off" and
> "on" option only, that would be fine for Khmer as well for now, until
> we can come up with a more ideal solution.
> 
> 
> > 3) The user only marks word breaks and non-word breaks when the
> > iterator fails.
> 
> The problem with this in Khmer is the user cannot tell when the
> breakiterator fails, unless it is on a line-break.  A word could be
> broken up into three parts and the user would never know it.

I usually notice iterator failures in Thai with unrecognised words,
which prompts red ink over strange extents. Usually the words are not
recognised because they're misspelt, but not always.  The problem I see
in Thai is usually not so much as extra word boundaries as misplaced
word boundaries. 

> Actually, if users could see where the
> breakiterator is breaking words, that would simplify things a lot.

That is a very significant observation.

> The only problem with this would be at the beginning of a document or
> the beginning of any new "re-syncing" segment because you might run
> into something like this:

> User input (example in English so others can make sense of it I hope):
> wordwordwordwordword.
> How the sentence is broken up by the breakiterator: wo r d word word
> wo rd word.
> User adds ZWSP to fix broken word on line-break: wo r d word word
> ZWSPwordword.

This example confuses me.  The problem here seems to be extra word
breaks rather than missing word breaks, and I don't see how confirming
a word break helps.

> But user has no idea the first word is broken incorrectly and that it
> is also spelled incorrectly.

> This is why it would be best (I think) as Martin suggested that when
> a ZWSP is detected it also turn off break iteration for the previous
> words up until a re-sync point.  This would practicly give the user
> an "off" option for the whole document if they so chose, and without
> the confusion of having to find some option in the Tools menu to turn
> it on or off - it would just be automatic, depending on the user's
> habit.

I was clearly not clear enough.  In the example above,
'wordwordwordwordword' is what I would call a dictionariless word - a
word-breaker without a dictionary (e.g. a shell's parser) would see it
as just one 'word'.  Therefore, once ZWSP is inserted and
word-breaking disabled, dictionary-based word-breaking is not applied to
wordwordwordZWSPwordword, and, typically, red squiggles appear under
wordwordword and wordword.  The boundary may be revealed by a phase
discontinuity or gap in the squiggle.  Under the proposed scheme, user
has to introduce another three ZWSPs even if the dictionary contains
all the words.

> I agree with this:
> 
> > Considering these four use cases, it seems simplest to let ZWSP, WJ
> > and ZWNBSP disable the iterator for the extent of the
> > dictionariless word in which it occurs.

> Except, it also should disable the breakiterator up to the previous
> re-sync point...

But that is what I meant!

> But actually, there is a rule in ICU for the MAIYAMOK
> so unless that is not working properly, I am not sure why LibreOffice
> doesn't break correctly...

I'll have to look further into this - and check that misbehaviour is
still happening.  Squiggly lines is what I chiefly remember.  There may
also be a Hunspell issue - the entries in the dictionary don't have
spaces before maiyamok.  The difference between finding word boundaries
and finding line boundaries may be significant here.  

Richard.