Adding Extension for Experimental Thai Spelling

Thu Sep 27 04:14:24 PDT 2012

On Thu, 27 Sep 2012 11:52:26 +0700
Nathan Wells <sungkhum at gmail.com> wrote:

>> 1. If you are shutting off the ICU breakiterator for text following,
>> we
>> should probably also do it for text preceding. Thus if there is a
>> ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break
>> iteration is disabled for the whole sentence.

> Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU
> break iteration should be disabled for the whole sentence.

What is the logic of this?

The use cases I see are:

1) The user always marks word breaks with ZWSP.

In this case, the ideal is to switch off the break iterator for the
language.

2) The user never marks word breaks.

In this case, the user is totally dependent on the break iterator, and
cannot be helped when it fails.

3) The user only marks word breaks and non-word breaks when the iterator
fails.

In this case, the iterator need only be switched off from the point of
override until it can clearly re-synch.  The obvious re-synching points
are word external punctuation, such as end-of-line, white space,
quotation marks, commas and dandas (and as dandas I would include U+0E2F
THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
ฯลฯ and ฯเปฯ).

Now, it may be easier to explain the rule if it applies to the whole
'word' - for what we are looking at is pretty much a 'word' as
understood by dictionariless editors.

4) Different parts of the text comes from different sources - some mark
word breaks, others expect the application to correctly identify them.

A ZWSP in a chunk of text would then tag the text as having come from a
a user in case 1 or 3; we have no reliable way of distinguishing the
two cases.  A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so
paragraph initial is suspect) would strongly suggest use case 3 - but
might occur in use case 1 if the user has had to fight a break
iterator.

(end of use cases)

Considering these four use cases, it seems simplest to let ZWSP, WJ and
ZWNBSP disable the iterator for the extent of the dictionariless word
in which it occurs.

What is the definition of an ICU sentence boundary?  I see no evidence
from CLDR 2.9 that it should be even approximately right for Khmer (or
Thai). Splitting Thai text into sentences is known to be challenging -
we can therefore expect different applications to split text
differently.

The one downside I can see to my suggestion is that if all word
boundaries are marked, switching the iterator off dictionariless word
by dictionariless word will require slightly greater use of WJ, for a
ZWSP later in the sentence will not necessarily be in the same
dictionariless word.

A related issue that seems not to being handled is repetition mark U+0E46 THAI
CHARACTER MAIYAMOK.  It should be separated from the preceding
alphabetic characters by a space, but Libreoffice doesn't recognised
the sequence as a possible continuation of the word.  Sometimes it
is a necessary part of a word.  I don't know what the situation is in
Khmer.

Richard.