Adding Extension for Experimental Thai Spelling

Wed Sep 26 21:45:04 PDT 2012

Dear Nathan,

> Here are some new ideas, ordered by desirability, with number one being the
> most desired, to number three being the least.
> 
> 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator
> for Khmer spell checking for characters following the zero-width space
> until encounters real space (U+0020) or end of sentence (detect end of
> sentence using ICU Sentence Boundary).

I think this is a good direction to head. I have to follow on comments:

1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence.

2. Why limit this to Khmer? I suspect as a model it should work for any non-space broken text.

Yours,
Martin

> 
> 2) Disable use of ICU breakiterator for Khmer spell checking by default,
> but allow users to enable it by adding a check-box to enable ICU
> breakiterator in the Tools > Options > Language Settings > Writing Aids >
> Options dialogue when a Khmer Hunspell dictionary is present (
> http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
>  ).
> 
> 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU
> breakiterator for Khmer is more accurate.
> 
> Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6
> it causes a lot of spelling errors to go unnoticed since the ICU
> breakiterator breaks words up incorrectly. So hopfully we can find a
> solution that will work with the current ICU breakiterator - though with
> ICU 50.1 the breakiterator for Khmer will have some improvements. But I do
> feel if solution 1 or 2 (or if someone else has better ideas) cannot
> be implemented the breakiterator for spelling with Khmer should be turned
> off in LibreOffice until the ICU breakiterator for Khmer is more accurate.
> 
> 
> Thanks again for your help and time, your input is greatly appreciated!
> 
> Sincerely,
> 
> Nathan
> 
> 
> 
> On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken <martin_hosken at sil.org>wrote:
> 
> > Dear All,
> >
> > > > An automatic word and line breaker is very necessary for Khmer and
> > > > Thai because traditionally they have no spaces between words, and so
> > > > line-breaking and spell checking require the use of a zero-width space
> > > > between words which is counterintuitive for most native speakers, and
> > > > so spell checking goes widely unused.
> >
> > I agree that automatic word breaking is a good thing and I am relieved to
> > see that libreoffice does it based on language selection and not on
> > automatic language guessing based on scripts. There are more languages that
> > use Thai script and Khmer script than just Thai and Khmer. So one of my
> > fears is already alleviated :)
> >
> > > > But now with the ICU code you implemented, Thai and Khmer can be
> > > > automatically broken, and the results are quite good. But with its
> > > > implementation in the real world, I have found some issues that I
> > > > wanted to raise and also suggest possible solutions. I write this as
> > > > an end-user, not so much as a programmer, nor do I claim to fully
> > > > understand the inner-workings of ICU and LibreOffice (because I don't!
> > > > ).
> > > >
> > > > First, I will do my best to explain the current results of the ICU
> > > > break iterator with Khmer:
> > > >
> > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > > >
> > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > > >
> > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > > > ឈ្មោះ|សិវកឥវលិយៈ
> > > >
> > > > The differences should be clear – the ICU break iterator does not
> > > > break the words with 100% accuracy.
> > > >
> > > > One possible solution to this issue is by how the ICU Break Iterator
> > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> > > > code was enabled to automatically break Khmer, if an end-user wanted
> > > > to spell check Khmer, they had to manually place U+200B characters to
> > > > separate words. This solution worked quite well, but was
> > > > counterintuitive to most native speakers, because Khmer has no spaces
> > > > (as stated before). But with this solution, an end-user could be sure
> > > > that their document was broken with 100% accuracy, if there was no
> > > > human error (something automatic solutions cannot do – it is more
> > > > along the lines of 80% accurate). What I propose, is that the break
> > > > iterator code in LibreOffice looks for U+200B characters in a given
> > > > string and considers them as a sign to NOT automatically break, but to
> > > > allow the end-user full control to manually break words. Let me
> > > > explain:
> > > >
> > > >      1. The code starts processing the text and automatically breaking
> > > >         it until it comes across a U+200B character. If one is found,
> > > >         it searches to see if there are any additional U+200B or U
> > > >         +0020 characters in the following 20 characters (or so), and
> > > >         if there are, the break iterator skips over those characters
> > > >         and starts again from the second U+200B character (or U+0020,
> > > >         but a U+0020 character would only signify the “close” of the
> > > >         manual break because sometimes a phrase will end and there
> > > >         will be an actual space – so if the word that the user wants
> > > >         to manually break has a “real” U+0020 space at the end of it,
> > > >         then the user does not need to put an additional U+200B
> > > >         character to close it) which then repeats, looking for U+200B
> > > >         characters etc.
> > > >
> > > >      2. This would allow end-users to choose to manually break their
> > > >         whole document so they can have precise control, as well as
> > > >         allow end-users to place U+200B characters around names of
> > > >         people, places or transliterations in order to tell the break
> > > >         iterator to not try to break those words.
> >
> > In principle I like this approach. I like the idea of being able to force
> > breaks and non-breaks. But I don't think we are quite there with this
> > solution yet. Here are my difficulties with it:
> >
> > 1. use of U+2060 makes string searching and spell checking harder (unless
> > WJ chars are stripped for searching and spell checking). They are not part
> > of the spelling of a word, so their introduction in the underlying text
> > stream is problematic for other text processing processes (like searching
> > as mentioned). This is less of an issue for U+200B ZWSP because that occurs
> > between words and searching across word boundaries is a rarer activity.
> > Likewise spell checking across word boundaries isn't really needed.
> >
> > 2. How do we come up with the range of what is considered a word between
> > two zwsp chars as opposed to two words? How close to the end of a string
> > must a zwsp occur to disable all breaking before the end of the string?
> > does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we need
> > to think harder (deeper) about the use of zwsp in this way and see if we
> > can come up with something with a little less ambiguity. Having said that,
> > I think we are going to have to think really hard, because I don't think
> > this is an easy problem.
> >
> > > >      4. I then notice that "ម្នាក់ទៀត" line breaks together (since the
> > > >         automatic line-breaking breaks them as one word. And I decide
> > > >         I would rather line-break after “ម្នាក់” rather than have both
> > > >         words break connected to each other, so I place a zero-width
> > > >         space between the words:
> > > >         មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ
> > > >         ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ
> > > >         the automatic break iterator comes to the zero width space and
> > > >         then stops automatically breaking and look ahead to see if
> > > >         there is a zero-width space or a “real” space within 20
> > > >         characters (this number might need refining, but I think 20
> > > >         characters would be enough). As there are no zero-width or
> > > >         “real” spaces within 20 characters, the break iterator then
> > > >         goes back to the previous zero-width and starts breaking
> > > >         starting from the zero-width character.
> >
> > Now what happens if I want to put zw around a word that occurs < 20 chars
> > after my last zw? The on off nature of the zw has now been inverted. One
> > option is to say that zw must always occur in pairs and you would have to
> > bracket your first or second word there. But then management of which zw is
> > on and which is off will get confusing for users.
> >
> > An alternative model is to weight breakpoints. An explicit breakpoint
> > weighs more highly than an automatically generated one. Then when it comes
> > to line breaking the weight of a breakpoint counts towards its choice as to
> > the actual break. For example if we say an explicit break is 2 and an
> > automatic is 1. Then we might use a square rule for distance and say: an
> > explicit break is preferred if it occurs closer to the end of a line than
> > 4x the distance to the last automatic break on the line. Or somesuch.
> >
> > Yours,
> > Martin
> >