Thanks Martin,<div> </div><div><div> </div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence.</blockquote> <div> </div><div>Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break iteration should be disabled for the whole sentence.</div><div> </div> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> 2. Why limit this to Khmer? I suspect as a model it should work for any non-space broken text.</blockquote></div><div> </div><div>I am only limiting it to Khmer because that is my expertise and I didn't want to cause problems for other languages - but it is possible these changes would be beneficial for other languages that are not broken by spaces (like Thai).</div> <div> </div><div> </div><div>Thanks,</div><div>Nathan <div class="gmail_quote">On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken <<a href="mailto:martin_hosken@sil.org" target="_blank">martin_hosken@sil.org</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear Nathan, <div class="im"> > Here are some new ideas, ordered by desirability, with number one being the > most desired, to number three being the least. > > 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator > for Khmer spell checking for characters following the zero-width space > until encounters real space (U+0020) or end of sentence (detect end of > sentence using ICU Sentence Boundary). </div>I think this is a good direction to head. I have to follow on comments: 1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence. 2. Why limit this to Khmer? I suspect as a model it should work for any non-space broken text. Yours, Martin <div class="HOEnZb"><div class="h5"> > > 2) Disable use of ICU breakiterator for Khmer spell checking by default, > but allow users to enable it by adding a check-box to enable ICU > breakiterator in the Tools > Options > Language Settings > Writing Aids > > Options dialogue when a Khmer Hunspell dictionary is present ( > <a href="http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version" target="_blank">http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version</a> > ). > > 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU > breakiterator for Khmer is more accurate. > > Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6 > it causes a lot of spelling errors to go unnoticed since the ICU > breakiterator breaks words up incorrectly. So hopfully we can find a > solution that will work with the current ICU breakiterator - though with > ICU 50.1 the breakiterator for Khmer will have some improvements. But I do > feel if solution 1 or 2 (or if someone else has better ideas) cannot > be implemented the breakiterator for spelling with Khmer should be turned > off in LibreOffice until the ICU breakiterator for Khmer is more accurate. > > > Thanks again for your help and time, your input is greatly appreciated! > > Sincerely, > > Nathan > > > > On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken <<a href="mailto:martin_hosken@sil.org">martin_hosken@sil.org</a>>wrote: > > > Dear All, > > > > > > An automatic word and line breaker is very necessary for Khmer and > > > > Thai because traditionally they have no spaces between words, and so > > > > line-breaking and spell checking require the use of a zero-width space > > > > between words which is counterintuitive for most native speakers, and > > > > so spell checking goes widely unused. > > > > I agree that automatic word breaking is a good thing and I am relieved to > > see that libreoffice does it based on language selection and not on > > automatic language guessing based on scripts. There are more languages that > > use Thai script and Khmer script than just Thai and Khmer. So one of my > > fears is already alleviated :) > > > > > > But now with the ICU code you implemented, Thai and Khmer can be > > > > automatically broken, and the results are quite good. But with its > > > > implementation in the real world, I have found some issues that I > > > > wanted to raise and also suggest possible solutions. I write this as > > > > an end-user, not so much as a programmer, nor do I claim to fully > > > > understand the inner-workings of ICU and LibreOffice (because I don't! > > > > ). > > > > > > > > First, I will do my best to explain the current results of the ICU > > > > break iterator with Khmer: > > > > > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > > > ឈ្មោះ|សិវកឥវលិយៈ > > > > > > > > The differences should be clear – the ICU break iterator does not > > > > break the words with 100% accuracy. > > > > > > > > One possible solution to this issue is by how the ICU Break Iterator > > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU > > > > code was enabled to automatically break Khmer, if an end-user wanted > > > > to spell check Khmer, they had to manually place U+200B characters to > > > > separate words. This solution worked quite well, but was > > > > counterintuitive to most native speakers, because Khmer has no spaces > > > > (as stated before). But with this solution, an end-user could be sure > > > > that their document was broken with 100% accuracy, if there was no > > > > human error (something automatic solutions cannot do – it is more > > > > along the lines of 80% accurate). What I propose, is that the break > > > > iterator code in LibreOffice looks for U+200B characters in a given > > > > string and considers them as a sign to NOT automatically break, but to > > > > allow the end-user full control to manually break words. Let me > > > > explain: > > > > > > > > 1. The code starts processing the text and automatically breaking > > > > it until it comes across a U+200B character. If one is found, > > > > it searches to see if there are any additional U+200B or U > > > > +0020 characters in the following 20 characters (or so), and > > > > if there are, the break iterator skips over those characters > > > > and starts again from the second U+200B character (or U+0020, > > > > but a U+0020 character would only signify the “close” of the > > > > manual break because sometimes a phrase will end and there > > > > will be an actual space – so if the word that the user wants > > > > to manually break has a “real” U+0020 space at the end of it, > > > > then the user does not need to put an additional U+200B > > > > character to close it) which then repeats, looking for U+200B > > > > characters etc. > > > > > > > > 2. This would allow end-users to choose to manually break their > > > > whole document so they can have precise control, as well as > > > > allow end-users to place U+200B characters around names of > > > > people, places or transliterations in order to tell the break > > > > iterator to not try to break those words. > > > > In principle I like this approach. I like the idea of being able to force > > breaks and non-breaks. But I don't think we are quite there with this > > solution yet. Here are my difficulties with it: > > > > 1. use of U+2060 makes string searching and spell checking harder (unless > > WJ chars are stripped for searching and spell checking). They are not part > > of the spelling of a word, so their introduction in the underlying text > > stream is problematic for other text processing processes (like searching > > as mentioned). This is less of an issue for U+200B ZWSP because that occurs > > between words and searching across word boundaries is a rarer activity. > > Likewise spell checking across word boundaries isn't really needed. > > > > 2. How do we come up with the range of what is considered a word between > > two zwsp chars as opposed to two words? How close to the end of a string > > must a zwsp occur to disable all breaking before the end of the string? > > does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we need > > to think harder (deeper) about the use of zwsp in this way and see if we > > can come up with something with a little less ambiguity. Having said that, > > I think we are going to have to think really hard, because I don't think > > this is an easy problem. > > > > > > 4. I then notice that "ម្នាក់ទៀត" line breaks together (since the > > > > automatic line-breaking breaks them as one word. And I decide > > > > I would rather line-break after “ម្នាក់” rather than have both > > > > words break connected to each other, so I place a zero-width > > > > space between the words: > > > > មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ > > > > ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ > > > > the automatic break iterator comes to the zero width space and > > > > then stops automatically breaking and look ahead to see if > > > > there is a zero-width space or a “real” space within 20 > > > > characters (this number might need refining, but I think 20 > > > > characters would be enough). As there are no zero-width or > > > > “real” spaces within 20 characters, the break iterator then > > > > goes back to the previous zero-width and starts breaking > > > > starting from the zero-width character. > > > > Now what happens if I want to put zw around a word that occurs < 20 chars > > after my last zw? The on off nature of the zw has now been inverted. One > > option is to say that zw must always occur in pairs and you would have to > > bracket your first or second word there. But then management of which zw is > > on and which is off will get confusing for users. > > > > An alternative model is to weight breakpoints. An explicit breakpoint > > weighs more highly than an automatically generated one. Then when it comes > > to line breaking the weight of a breakpoint counts towards its choice as to > > the actual break. For example if we say an explicit break is 2 and an > > automatic is 1. Then we might use a square rule for distance and say: an > > explicit break is preferred if it occurs closer to the end of a line than > > 4x the distance to the last automatic break on the line. Or somesuch. > > > > Yours, > > Martin > > </div></div></blockquote></div> </div></div>