Thanks Martin,<div><br></div><div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence.</blockquote>
<div><br></div><div>Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break iteration should be disabled for the whole sentence.</div><div><br></div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
2. Why limit this to Khmer? I suspect as a model it should work for any non-space broken text.</blockquote></div><div><br></div><div>I am only limiting it to Khmer because that is my expertise and I didn't want to cause problems for other languages - but it is possible these changes would be beneficial for other languages that are not broken by spaces (like Thai).</div>
<div><br></div><div><br></div><div>Thanks,</div><div>Nathan<br><br><div class="gmail_quote">On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken <span dir="ltr"><<a href="mailto:martin_hosken@sil.org" target="_blank">martin_hosken@sil.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dear Nathan,<br>
<div class="im"><br>
> Here are some new ideas, ordered by desirability, with number one being the<br>
> most desired, to number three being the least.<br>
><br>
> 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator<br>
> for Khmer spell checking for characters following the zero-width space<br>
> until encounters real space (U+0020) or end of sentence (detect end of<br>
> sentence using ICU Sentence Boundary).<br>
<br>
</div>I think this is a good direction to head. I have to follow on comments:<br>
<br><u>
1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence.<br>
<br>
2. Why limit this to Khmer? I suspect as a model it should work for any non-space broken text.</u><br>
<br>
Yours,<br>
Martin<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
<br>
><br>
> 2) Disable use of ICU breakiterator for Khmer spell checking by default,<br>
> but allow users to enable it by adding a check-box to enable ICU<br>
> breakiterator in the Tools > Options > Language Settings > Writing Aids ><br>
> Options dialogue when a Khmer Hunspell dictionary is present (<br>
> <a href="http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version" target="_blank">http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version</a><br>
> ).<br>
><br>
> 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU<br>
> breakiterator for Khmer is more accurate.<br>
><br>
> Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6<br>
> it causes a lot of spelling errors to go unnoticed since the ICU<br>
> breakiterator breaks words up incorrectly. So hopfully we can find a<br>
> solution that will work with the current ICU breakiterator - though with<br>
> ICU 50.1 the breakiterator for Khmer will have some improvements. But I do<br>
> feel if solution 1 or 2 (or if someone else has better ideas) cannot<br>
> be implemented the breakiterator for spelling with Khmer should be turned<br>
> off in LibreOffice until the ICU breakiterator for Khmer is more accurate.<br>
><br>
><br>
> Thanks again for your help and time, your input is greatly appreciated!<br>
><br>
> Sincerely,<br>
><br>
> Nathan<br>
><br>
><br>
><br>
> On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken <<a href="mailto:martin_hosken@sil.org">martin_hosken@sil.org</a>>wrote:<br>
><br>
> > Dear All,<br>
> ><br>
> > > > An automatic word and line breaker is very necessary for Khmer and<br>
> > > > Thai because traditionally they have no spaces between words, and so<br>
> > > > line-breaking and spell checking require the use of a zero-width space<br>
> > > > between words which is counterintuitive for most native speakers, and<br>
> > > > so spell checking goes widely unused.<br>
> ><br>
> > I agree that automatic word breaking is a good thing and I am relieved to<br>
> > see that libreoffice does it based on language selection and not on<br>
> > automatic language guessing based on scripts. There are more languages that<br>
> > use Thai script and Khmer script than just Thai and Khmer. So one of my<br>
> > fears is already alleviated :)<br>
> ><br>
> > > > But now with the ICU code you implemented, Thai and Khmer can be<br>
> > > > automatically broken, and the results are quite good. But with its<br>
> > > > implementation in the real world, I have found some issues that I<br>
> > > > wanted to raise and also suggest possible solutions. I write this as<br>
> > > > an end-user, not so much as a programmer, nor do I claim to fully<br>
> > > > understand the inner-workings of ICU and LibreOffice (because I don't!<br>
> > > > ).<br>
> > > ><br>
> > > > First, I will do my best to explain the current results of the ICU<br>
> > > > break iterator with Khmer:<br>
> > > ><br>
> > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ<br>
> > > ><br>
> > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ<br>
> > > ><br>
> > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|<br>
> > > > ឈ្មោះ|សិវកឥវលិយៈ<br>
> > > ><br>
> > > > The differences should be clear – the ICU break iterator does not<br>
> > > > break the words with 100% accuracy.<br>
> > > ><br>
> > > > One possible solution to this issue is by how the ICU Break Iterator<br>
> > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU<br>
> > > > code was enabled to automatically break Khmer, if an end-user wanted<br>
> > > > to spell check Khmer, they had to manually place U+200B characters to<br>
> > > > separate words. This solution worked quite well, but was<br>
> > > > counterintuitive to most native speakers, because Khmer has no spaces<br>
> > > > (as stated before). But with this solution, an end-user could be sure<br>
> > > > that their document was broken with 100% accuracy, if there was no<br>
> > > > human error (something automatic solutions cannot do – it is more<br>
> > > > along the lines of 80% accurate). What I propose, is that the break<br>
> > > > iterator code in LibreOffice looks for U+200B characters in a given<br>
> > > > string and considers them as a sign to NOT automatically break, but to<br>
> > > > allow the end-user full control to manually break words. Let me<br>
> > > > explain:<br>
> > > ><br>
> > > > 1. The code starts processing the text and automatically breaking<br>
> > > > it until it comes across a U+200B character. If one is found,<br>
> > > > it searches to see if there are any additional U+200B or U<br>
> > > > +0020 characters in the following 20 characters (or so), and<br>
> > > > if there are, the break iterator skips over those characters<br>
> > > > and starts again from the second U+200B character (or U+0020,<br>
> > > > but a U+0020 character would only signify the “close” of the<br>
> > > > manual break because sometimes a phrase will end and there<br>
> > > > will be an actual space – so if the word that the user wants<br>
> > > > to manually break has a “real” U+0020 space at the end of it,<br>
> > > > then the user does not need to put an additional U+200B<br>
> > > > character to close it) which then repeats, looking for U+200B<br>
> > > > characters etc.<br>
> > > ><br>
> > > > 2. This would allow end-users to choose to manually break their<br>
> > > > whole document so they can have precise control, as well as<br>
> > > > allow end-users to place U+200B characters around names of<br>
> > > > people, places or transliterations in order to tell the break<br>
> > > > iterator to not try to break those words.<br>
> ><br>
> > In principle I like this approach. I like the idea of being able to force<br>
> > breaks and non-breaks. But I don't think we are quite there with this<br>
> > solution yet. Here are my difficulties with it:<br>
> ><br>
> > 1. use of U+2060 makes string searching and spell checking harder (unless<br>
> > WJ chars are stripped for searching and spell checking). They are not part<br>
> > of the spelling of a word, so their introduction in the underlying text<br>
> > stream is problematic for other text processing processes (like searching<br>
> > as mentioned). This is less of an issue for U+200B ZWSP because that occurs<br>
> > between words and searching across word boundaries is a rarer activity.<br>
> > Likewise spell checking across word boundaries isn't really needed.<br>
> ><br>
> > 2. How do we come up with the range of what is considered a word between<br>
> > two zwsp chars as opposed to two words? How close to the end of a string<br>
> > must a zwsp occur to disable all breaking before the end of the string?<br>
> > does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we need<br>
> > to think harder (deeper) about the use of zwsp in this way and see if we<br>
> > can come up with something with a little less ambiguity. Having said that,<br>
> > I think we are going to have to think really hard, because I don't think<br>
> > this is an easy problem.<br>
> ><br>
> > > > 4. I then notice that "ម្នាក់ទៀត" line breaks together (since the<br>
> > > > automatic line-breaking breaks them as one word. And I decide<br>
> > > > I would rather line-break after “ម្នាក់” rather than have both<br>
> > > > words break connected to each other, so I place a zero-width<br>
> > > > space between the words:<br>
> > > > មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ<br>
> > > > ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ<br>
> > > > the automatic break iterator comes to the zero width space and<br>
> > > > then stops automatically breaking and look ahead to see if<br>
> > > > there is a zero-width space or a “real” space within 20<br>
> > > > characters (this number might need refining, but I think 20<br>
> > > > characters would be enough). As there are no zero-width or<br>
> > > > “real” spaces within 20 characters, the break iterator then<br>
> > > > goes back to the previous zero-width and starts breaking<br>
> > > > starting from the zero-width character.<br>
> ><br>
> > Now what happens if I want to put zw around a word that occurs < 20 chars<br>
> > after my last zw? The on off nature of the zw has now been inverted. One<br>
> > option is to say that zw must always occur in pairs and you would have to<br>
> > bracket your first or second word there. But then management of which zw is<br>
> > on and which is off will get confusing for users.<br>
> ><br>
> > An alternative model is to weight breakpoints. An explicit breakpoint<br>
> > weighs more highly than an automatically generated one. Then when it comes<br>
> > to line breaking the weight of a breakpoint counts towards its choice as to<br>
> > the actual break. For example if we say an explicit break is 2 and an<br>
> > automatic is 1. Then we might use a square rule for distance and say: an<br>
> > explicit break is preferred if it occurs closer to the end of a line than<br>
> > 4x the distance to the last automatic break on the line. Or somesuch.<br>
> ><br>
> > Yours,<br>
> > Martin<br>
> ><br>
</div></div></blockquote></div><br></div></div>