Adding Extension for Experimental Thai Spelling

Wed Sep 26 21:52:26 PDT 2012

Thanks Martin,

1. If you are shutting off the ICU breakiterator for text following, we
> should probably also do it for text preceding. Thus if there is a ZWSP or
> ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
> for the whole sentence.

Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break
iteration should be disabled for the whole sentence.

2. Why limit this to Khmer? I suspect as a model it should work for any
> non-space broken text.

I am only limiting it to Khmer because that is my expertise and I didn't
want to cause problems for other languages - but it is possible these
changes would be beneficial for other languages that are not broken by
spaces (like Thai).

Thanks,
Nathan

On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken <martin_hosken at sil.org>wrote:

> Dear Nathan,
>
> > Here are some new ideas, ordered by desirability, with number one being
> the
> > most desired, to number three being the least.
> >
> > 1) When a zero-width space is detected (U+200B), shut off ICU
> breakiterator
> > for Khmer spell checking for characters following the zero-width space
> > until encounters real space (U+0020) or end of sentence (detect end of
> > sentence using ICU Sentence Boundary).
>
> I think this is a good direction to head. I have to follow on comments:
>
> * 1. If you are shutting off the ICU breakiterator for text following, we
> should probably also do it for text preceding. Thus if there is a ZWSP or
> ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
> for the whole sentence.
>
> 2. Why limit this to Khmer? I suspect as a model it should work for any
> non-space broken text.*
>
> Yours,
> Martin
>
>
>
> >
> > 2) Disable use of ICU breakiterator for Khmer spell checking by default,
> > but allow users to enable it by adding a check-box to enable ICU
> > breakiterator in the Tools > Options > Language Settings > Writing Aids >
> > Options dialogue when a Khmer Hunspell dictionary is present (
> >
> http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
> >  ).
> >
> > 3) Disable use of ICU breakiterator for Khmer spell checking until the
> ICU
> > breakiterator for Khmer is more accurate.
> >
> > Currently, with the ICU breakiterator for Khmer enabled in LibreOffice
> 3.6
> > it causes a lot of spelling errors to go unnoticed since the ICU
> > breakiterator breaks words up incorrectly. So hopfully we can find a
> > solution that will work with the current ICU breakiterator - though with
> > ICU 50.1 the breakiterator for Khmer will have some improvements. But I
> do
> > feel if solution 1 or 2 (or if someone else has better ideas) cannot
> > be implemented the breakiterator for spelling with Khmer should be turned
> > off in LibreOffice until the ICU breakiterator for Khmer is more
> accurate.
> >
> >
> > Thanks again for your help and time, your input is greatly appreciated!
> >
> > Sincerely,
> >
> > Nathan
> >
> >
> >
> > On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken <martin_hosken at sil.org
> >wrote:
> >
> > > Dear All,
> > >
> > > > > An automatic word and line breaker is very necessary for Khmer and
> > > > > Thai because traditionally they have no spaces between words, and
> so
> > > > > line-breaking and spell checking require the use of a zero-width
> space
> > > > > between words which is counterintuitive for most native speakers,
> and
> > > > > so spell checking goes widely unused.
> > >
> > > I agree that automatic word breaking is a good thing and I am relieved
> to
> > > see that libreoffice does it based on language selection and not on
> > > automatic language guessing based on scripts. There are more languages
> that
> > > use Thai script and Khmer script than just Thai and Khmer. So one of my
> > > fears is already alleviated :)
> > >
> > > > > But now with the ICU code you implemented, Thai and Khmer can be
> > > > > automatically broken, and the results are quite good. But with its
> > > > > implementation in the real world, I have found some issues that I
> > > > > wanted to raise and also suggest possible solutions. I write this
> as
> > > > > an end-user, not so much as a programmer, nor do I claim to fully
> > > > > understand the inner-workings of ICU and LibreOffice (because I
> don't!
> > > > > ).
> > > > >
> > > > > First, I will do my best to explain the current results of the ICU
> > > > > break iterator with Khmer:
> > > > >
> > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > > > >
> > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > > > >
> > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > > > > ឈ្មោះ|សិវកឥវលិយៈ
> > > > >
> > > > > The differences should be clear – the ICU break iterator does not
> > > > > break the words with 100% accuracy.
> > > > >
> > > > > One possible solution to this issue is by how the ICU Break
> Iterator
> > > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before
> ICU
> > > > > code was enabled to automatically break Khmer, if an end-user
> wanted
> > > > > to spell check Khmer, they had to manually place U+200B characters
> to
> > > > > separate words. This solution worked quite well, but was
> > > > > counterintuitive to most native speakers, because Khmer has no
> spaces
> > > > > (as stated before). But with this solution, an end-user could be
> sure
> > > > > that their document was broken with 100% accuracy, if there was no
> > > > > human error (something automatic solutions cannot do – it is more
> > > > > along the lines of 80% accurate). What I propose, is that the break
> > > > > iterator code in LibreOffice looks for U+200B characters in a given
> > > > > string and considers them as a sign to NOT automatically break,
> but to
> > > > > allow the end-user full control to manually break words. Let me
> > > > > explain:
> > > > >
> > > > >      1. The code starts processing the text and automatically
> breaking
> > > > >         it until it comes across a U+200B character. If one is
> found,
> > > > >         it searches to see if there are any additional U+200B or U
> > > > >         +0020 characters in the following 20 characters (or so),
> and
> > > > >         if there are, the break iterator skips over those
> characters
> > > > >         and starts again from the second U+200B character (or
> U+0020,
> > > > >         but a U+0020 character would only signify the “close” of
> the
> > > > >         manual break because sometimes a phrase will end and there
> > > > >         will be an actual space – so if the word that the user
> wants
> > > > >         to manually break has a “real” U+0020 space at the end of
> it,
> > > > >         then the user does not need to put an additional U+200B
> > > > >         character to close it) which then repeats, looking for
> U+200B
> > > > >         characters etc.
> > > > >
> > > > >      2. This would allow end-users to choose to manually break
> their
> > > > >         whole document so they can have precise control, as well as
> > > > >         allow end-users to place U+200B characters around names of
> > > > >         people, places or transliterations in order to tell the
> break
> > > > >         iterator to not try to break those words.
> > >
> > > In principle I like this approach. I like the idea of being able to
> force
> > > breaks and non-breaks. But I don't think we are quite there with this
> > > solution yet. Here are my difficulties with it:
> > >
> > > 1. use of U+2060 makes string searching and spell checking harder
> (unless
> > > WJ chars are stripped for searching and spell checking). They are not
> part
> > > of the spelling of a word, so their introduction in the underlying text
> > > stream is problematic for other text processing processes (like
> searching
> > > as mentioned). This is less of an issue for U+200B ZWSP because that
> occurs
> > > between words and searching across word boundaries is a rarer activity.
> > > Likewise spell checking across word boundaries isn't really needed.
> > >
> > > 2. How do we come up with the range of what is considered a word
> between
> > > two zwsp chars as opposed to two words? How close to the end of a
> string
> > > must a zwsp occur to disable all breaking before the end of the string?
> > > does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we
> need
> > > to think harder (deeper) about the use of zwsp in this way and see if
> we
> > > can come up with something with a little less ambiguity. Having said
> that,
> > > I think we are going to have to think really hard, because I don't
> think
> > > this is an easy problem.
> > >
> > > > >      4. I then notice that "ម្នាក់ទៀត" line breaks together (since
> the
> > > > >         automatic line-breaking breaks them as one word. And I
> decide
> > > > >         I would rather line-break after “ម្នាក់” rather than have
> both
> > > > >         words break connected to each other, so I place a
> zero-width
> > > > >         space between the words:
> > > > >         មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ
> > > > >         ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ
> > > > >         the automatic break iterator comes to the zero width space
> and
> > > > >         then stops automatically breaking and look ahead to see if
> > > > >         there is a zero-width space or a “real” space within 20
> > > > >         characters (this number might need refining, but I think 20
> > > > >         characters would be enough). As there are no zero-width or
> > > > >         “real” spaces within 20 characters, the break iterator then
> > > > >         goes back to the previous zero-width and starts breaking
> > > > >         starting from the zero-width character.
> > >
> > > Now what happens if I want to put zw around a word that occurs < 20
> chars
> > > after my last zw? The on off nature of the zw has now been inverted.
> One
> > > option is to say that zw must always occur in pairs and you would have
> to
> > > bracket your first or second word there. But then management of which
> zw is
> > > on and which is off will get confusing for users.
> > >
> > > An alternative model is to weight breakpoints. An explicit breakpoint
> > > weighs more highly than an automatically generated one. Then when it
> comes
> > > to line breaking the weight of a breakpoint counts towards its choice
> as to
> > > the actual break. For example if we say an explicit break is 2 and an
> > > automatic is 1. Then we might use a square rule for distance and say:
> an
> > > explicit break is preferred if it occurs closer to the end of a line
> than
> > > 4x the distance to the last automatic break on the line. Or somesuch.
> > >
> > > Yours,
> > > Martin
> > >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20120927/f42731c7/attachment-0001.html>