Adding Extension for Experimental Thai Spelling

Thu Jul 26 15:53:57 PDT 2012

On Thu, 26 Jul 2012 16:33:00 +0700
Martin Hosken <martin_hosken at sil.org> wrote:

> 1. use of U+2060 makes string searching and spell checking harder
> (unless WJ chars are stripped for searching and spell checking). They
> are not part of the spelling of a word, so their introduction in the
> underlying text stream is problematic for other text processing
> processes (like searching as mentioned). This is less of an issue for
> U+200B ZWSP because that occurs between words and searching across
> word boundaries is a rarer activity. Likewise spell checking across
> word boundaries isn't really needed.

U+2060 WJ should definitely be skipped for searching and, once it has
done its gluing job, spell-checking look-up, just like U+00AD SOFT
HYPHEN.  They're both indubitable complete ignorables for collation and
therefore for UCA (Unicode Collation Algorithm) search.

> Now what happens if I want to put zw around a word that occurs < 20
> chars after my last zw? The on off nature of the zw has now been
> inverted. One option is to say that zw must always occur in pairs and
> you would have to bracket your first or second word there. But then
> management of which zw is on and which is off will get confusing for
> users.

I think that is the wrong way of looking at it.  Various characters,
some ZWSP, others more natural, such as SP, tell the break iterators
where some word boundaries are.  The rule we would have is that the
break iterator should not try to break runs of less than, say, 20
characters if one of the boundaries is provided by ZWSP.  I am not
proposing that we limit how many breaks it makes in a run - 21
characters could be broken into seven words.  The short runs the break
iterator is prohibited from breaking can still be checked for spelling.
If they are not words, then the user can respond to the red wiggly line
appropriately, e.g. by putting extra word breaks in.

In the example you gave, one would have to split the words between the
delimited words.  I think the users must accept that - the rule we
would be working with is that the break iterator does not break short
runs created by inserted ZWSP, and that is a simple rule to
understand.  I suppose there may be some question of what to count -
base consonants perhaps? (In Unicode jargon, that would be extended
default graphemes.)  That might be a luxury feature we never need to
add.

Richard.