Adding Extension for Experimental Thai Spelling

Nathan Wells sungkhum at gmail.com
Thu Sep 27 07:08:13 PDT 2012


Thanks for your input Richard,

Firstly, you are right, I was mistaken about ICU and the breakiterator
working for sentences (I just tried it right now and it does work, but just
not with the normal "khan" or "period" of Khmer rather it works with Latin
sentence markers which is not enough).  I had thought when we put in the
code for the breakiterator that it also covered the sentence, but I guess
not (I will work towards getting it working for Khmer).

In response to your comments:

1) The user always marks word breaks with ZWSP.
> In this case, the ideal is to switch off the break iterator for the
> language.


There is some truth to this - and that is why I had it as my last option
(just turning the whole thing off). But the ICU breakiterator for Khmer
actually works quite well with normal language - it breaks down when there
are proper names. So turning it off is an option, but not the most ideal
solution. Some users will continue to always mark breaks with a ZWSP (for
full control), but I also think having the option to turn it off for more
complex sentences would be ideal.

2) The user never marks word breaks.
> In this case, the user is totally dependent on the break iterator, and
> cannot be helped when it fails.

As I said above, I think a both/and solution would be idea for Khmer. But
if in the end it would work better for Thai to have and "off" and "on"
option only, that would be fine for Khmer as well for now, until we can
come up with a more ideal solution.


3) The user only marks word breaks and non-word breaks when the iterator
> fails.

The problem with this in Khmer is the user cannot tell when the
breakiterator fails, unless it is on a line-break.  A word could be broken
up into three parts and the user would never know it. This is why the issue
is so complex. Actually, if users could see where the breakiterator is
breaking words, that would simplify things a lot. Though I still think the
option to turn the breakiterator "on" or "off" for certain sentences would
be ideal (especially sentences with a ton of proper nouns where the ICU
breakiterator for Khmer has the most trouble).

As far as finding re-syncing points (when to turn the breakitorator back on
when it is turned off by a ZWSP) I agree with you:

> The obvious re-synching points
> are word external punctuation, such as end-of-line, white space,
> quotation marks, commas and dandas (and as dandas I would include U+0E2F
> THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
> KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
> ฯลฯ and ฯเปฯ).


The only problem with this would be at the beginning of a document or the
beginning of any new "re-syncing" segment because you might run into
something like this:

User input (example in English so others can make sense of it I hope):
wordwordwordwordword.
How the sentence is broken up by the breakiterator: wo r d word word wo rd
word.
User adds ZWSP to fix broken word on line-break: wo r d word word
ZWSPwordword.
But user has no idea the first word is broken incorrectly and that it is
also spelled incorrectly.

This is why it would be best (I think) as Martin suggested that when a ZWSP
is detected it also turn off break iteration for the previous words up
until a re-sync point.  This would practicly give the user an "off" option
for the whole document if they so chose, and without the confusion of
having to find some option in the Tools menu to turn it on or off - it
would just be automatic, depending on the user's habit.

I agree with this:

> Considering these four use cases, it seems simplest to let ZWSP, WJ and
> ZWNBSP disable the iterator for the extent of the dictionariless word
> in which it occurs.


Except, it also should disable the breakiterator up to the previous re-sync
point to enable users to functionally "turn off" the breakitorator if they
so choose (for Khmer this is necessary because for a book editor like
myself, I will want to manually put the breaks and not let the
breakitorator do anything automatically - but the feature is nice for the
casual user because it is much faster and more intuitive to not type spaces
between words for Cambodians).

A related issue that seems not to being handled is repetition mark U+0E46
> THAI
> CHARACTER MAIYAMOK.  It should be separated from the preceding
> alphabetic characters by a space, but Libreoffice doesn't recognised
> the sequence as a possible continuation of the word.  Sometimes it
> is a necessary part of a word.  I don't know what the situation is in
> Khmer.


In Khmer the repeat character (U+17D7 LEK TOO) is not separated from the
preceding word by a space, but is connected, so this is not an issue for
us.  But actually, there is a rule in ICU for the MAIYAMOK so unless that
is not working properly, I am not sure why LibreOffice doesn't break
correctly...

Here's the code from ICU4c for the Thai  MAIYAMOK from dictbe.cpp if anyone
is interested...

if (uc == THAI_MAIYAMOK<http://fossies.org/dox/icu4c-49_1_2-src/dictbe_8cpp.html#a6b5f33afcd7763004fa04d88bcde2770>)
{
393  if (utext_previous32<http://fossies.org/dox/icu4c-49_1_2-src/urename_8h.html#acf738fa383c571f940ad641faeeebba8>(text)
!= THAI_MAIYAMOK<http://fossies.org/dox/icu4c-49_1_2-src/dictbe_8cpp.html#a6b5f33afcd7763004fa04d88bcde2770>)
{
394  // Skip over previous end and MAIYAMOK
395  utext_next32<http://fossies.org/dox/icu4c-49_1_2-src/urename_8h.html#a6d68a2734d6a1f0ea610dbaed40b0eec>
(text);
396  utext_next32<http://fossies.org/dox/icu4c-49_1_2-src/urename_8h.html#a6d68a2734d6a1f0ea610dbaed40b0eec>
(text);
397  wordLength += 1; // Add MAIYAMOK to word


Thoughts?

-Nathan


On Thu, Sep 27, 2012 at 6:14 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Thu, 27 Sep 2012 11:52:26 +0700
> Nathan Wells <sungkhum at gmail.com> wrote:
>
> >> 1. If you are shutting off the ICU breakiterator for text following,
> >> we
> >> should probably also do it for text preceding. Thus if there is a
> >> ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break
> >> iteration is disabled for the whole sentence.
>
> > Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU
> > break iteration should be disabled for the whole sentence.
>
> What is the logic of this?
>
> The use cases I see are:
>
> 1) The user always marks word breaks with ZWSP.
>
> In this case, the ideal is to switch off the break iterator for the
> language.
>
> 2) The user never marks word breaks.
>
> In this case, the user is totally dependent on the break iterator, and
> cannot be helped when it fails.
>
> 3) The user only marks word breaks and non-word breaks when the iterator
> fails.
>
> In this case, the iterator need only be switched off from the point of
> override until it can clearly re-synch.  The obvious re-synching points
> are word external punctuation, such as end-of-line, white space,
> quotation marks, commas and dandas (and as dandas I would include U+0E2F
> THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
> KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
> ฯลฯ and ฯเปฯ).
>
> Now, it may be easier to explain the rule if it applies to the whole
> 'word' - for what we are looking at is pretty much a 'word' as
> understood by dictionariless editors.
>
> 4) Different parts of the text comes from different sources - some mark
> word breaks, others expect the application to correctly identify them.
>
> A ZWSP in a chunk of text would then tag the text as having come from a
> a user in case 1 or 3; we have no reliable way of distinguishing the
> two cases.  A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so
> paragraph initial is suspect) would strongly suggest use case 3 - but
> might occur in use case 1 if the user has had to fight a break
> iterator.
>
> (end of use cases)
>
> Considering these four use cases, it seems simplest to let ZWSP, WJ and
> ZWNBSP disable the iterator for the extent of the dictionariless word
> in which it occurs.
>
> What is the definition of an ICU sentence boundary?  I see no evidence
> from CLDR 2.9 that it should be even approximately right for Khmer (or
> Thai). Splitting Thai text into sentences is known to be challenging -
> we can therefore expect different applications to split text
> differently.
>
> The one downside I can see to my suggestion is that if all word
> boundaries are marked, switching the iterator off dictionariless word
> by dictionariless word will require slightly greater use of WJ, for a
> ZWSP later in the sentence will not necessarily be in the same
> dictionariless word.
>
> A related issue that seems not to being handled is repetition mark U+0E46
> THAI
> CHARACTER MAIYAMOK.  It should be separated from the preceding
> alphabetic characters by a space, but Libreoffice doesn't recognised
> the sequence as a possible continuation of the word.  Sometimes it
> is a necessary part of a word.  I don't know what the situation is in
> Khmer.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20120927/188be3b9/attachment-0001.html>


More information about the LibreOffice mailing list