Adding Extension for Experimental Thai Spelling

Richard Wordingham richard.wordingham at ntlworld.com
Thu Sep 27 10:55:36 PDT 2012


On Thu, 27 Sep 2012 21:08:13 +0700
Nathan Wells <sungkhum at gmail.com> wrote:

> Firstly, you are right, I was mistaken about ICU and the breakiterator
> working for sentences (I just tried it right now and it does work,
> but just not with the normal "khan" or "period" of Khmer rather it
> works with Latin sentence markers which is not enough).  I had
> thought when we put in the code for the breakiterator that it also
> covered the sentence, but I guess not (I will work towards getting it
> working for Khmer).

It may be worth modifying the CLDR definition - sentence breaks can be
customised, though it is presently only done for Greek.  However, if
you want Khmer *sentence* rather than *clause* breaking, it will need a
lot of work - papers are still being published on breaking Thai into
sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ).

> In response to your comments:
> 
> > 1) The user always marks word breaks with ZWSP.
> > In this case, the ideal is to switch off the break iterator for the
> > language.
> 
> 
> There is some truth to this - and that is why I had it as my last
> option (just turning the whole thing off). But the ICU breakiterator
> for Khmer actually works quite well with normal language - it breaks
> down when there are proper names. So turning it off is an option, but
> not the most ideal solution. Some users will continue to always mark
> breaks with a ZWSP (for full control), but I also think having the
> option to turn it off for more complex sentences would be ideal.
> 
> > 2) The user never marks word breaks.
> > In this case, the user is totally dependent on the break iterator,
> > and cannot be helped when it fails.
> 
> As I said above, I think a both/and solution would be idea for Khmer.
> But if in the end it would work better for Thai to have and "off" and
> "on" option only, that would be fine for Khmer as well for now, until
> we can come up with a more ideal solution.
> 
> 
> > 3) The user only marks word breaks and non-word breaks when the
> > iterator fails.
> 
> The problem with this in Khmer is the user cannot tell when the
> breakiterator fails, unless it is on a line-break.  A word could be
> broken up into three parts and the user would never know it.

I usually notice iterator failures in Thai with unrecognised words,
which prompts red ink over strange extents. Usually the words are not
recognised because they're misspelt, but not always.  The problem I see
in Thai is usually not so much as extra word boundaries as misplaced
word boundaries. 

> Actually, if users could see where the
> breakiterator is breaking words, that would simplify things a lot.

That is a very significant observation.

> The only problem with this would be at the beginning of a document or
> the beginning of any new "re-syncing" segment because you might run
> into something like this:

> User input (example in English so others can make sense of it I hope):
> wordwordwordwordword.
> How the sentence is broken up by the breakiterator: wo r d word word
> wo rd word.
> User adds ZWSP to fix broken word on line-break: wo r d word word
> ZWSPwordword.

This example confuses me.  The problem here seems to be extra word
breaks rather than missing word breaks, and I don't see how confirming
a word break helps.

> But user has no idea the first word is broken incorrectly and that it
> is also spelled incorrectly.

> This is why it would be best (I think) as Martin suggested that when
> a ZWSP is detected it also turn off break iteration for the previous
> words up until a re-sync point.  This would practicly give the user
> an "off" option for the whole document if they so chose, and without
> the confusion of having to find some option in the Tools menu to turn
> it on or off - it would just be automatic, depending on the user's
> habit.

I was clearly not clear enough.  In the example above,
'wordwordwordwordword' is what I would call a dictionariless word - a
word-breaker without a dictionary (e.g. a shell's parser) would see it
as just one 'word'.  Therefore, once ZWSP is inserted and
word-breaking disabled, dictionary-based word-breaking is not applied to
wordwordwordZWSPwordword, and, typically, red squiggles appear under
wordwordword and wordword.  The boundary may be revealed by a phase
discontinuity or gap in the squiggle.  Under the proposed scheme, user
has to introduce another three ZWSPs even if the dictionary contains
all the words.

> I agree with this:
> 
> > Considering these four use cases, it seems simplest to let ZWSP, WJ
> > and ZWNBSP disable the iterator for the extent of the
> > dictionariless word in which it occurs.

> Except, it also should disable the breakiterator up to the previous
> re-sync point...

But that is what I meant!

> But actually, there is a rule in ICU for the MAIYAMOK
> so unless that is not working properly, I am not sure why LibreOffice
> doesn't break correctly...

I'll have to look further into this - and check that misbehaviour is
still happening.  Squiggly lines is what I chiefly remember.  There may
also be a Hunspell issue - the entries in the dictionary don't have
spaces before maiyamok.  The difference between finding word boundaries
and finding line boundaries may be significant here.  

Richard.


More information about the LibreOffice mailing list