Thanks for your reply.<div> </div><div>Yes, a "view->word boundaries" mode would be very helpful (or even incorporating the current "view->field shading" to include viewing 'gray marks' at the automatic ICU breaking so that users can see what is being done). Would this be hard to implement?</div> <div> </div><div>Also, we are making some changes to the ICU break iterator dictionary for Khmer - and I've heard there will be some changes in ICU 50 which should improve results for Khmer.</div><div> </div><div> If anyone has any ideas - it would be appreciated.</div><div> </div><div>Thanks!</div><div>Nathan</div><div> <div class="gmail_quote">On Wed, Jul 25, 2012 at 8:41 PM, Caolán McNamara <<a href="mailto:caolanm@redhat.com" target="_blank">caolanm@redhat.com</a>> wrote: <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I'll cc this to the list if you don't mind, in order to archive it. I have no immediate great ideas. But I wonder if a "view->word boundaries" mode would be helpful, i.e. something that indicates the boundaries of the words that the software thinks exist. <div><div class="h5"> On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote: > > I hope you don't mind if I write and ask some more questions and ask > for additional help in making the break iterator more functional in > LibreOffice. Thank you again for your help implementing ICU for Khmer > in LibreOffice. I downloaded a recent beta build with your code > implemented and did some testing – it is great! But it also brought to > my attention some issues that hamper the useability of the automatic > breaking for Khmer (and I also believe for Thai – see this discussion > - > <a href="http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455" target="_blank">http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455</a>). > > > An automatic word and line breaker is very necessary for Khmer and > Thai because traditionally they have no spaces between words, and so > line-breaking and spell checking require the use of a zero-width space > between words which is counterintuitive for most native speakers, and > so spell checking goes widely unused. > But now with the ICU code you implemented, Thai and Khmer can be > automatically broken, and the results are quite good. But with its > implementation in the real world, I have found some issues that I > wanted to raise and also suggest possible solutions. I write this as > an end-user, not so much as a programmer, nor do I claim to fully > understand the inner-workings of ICU and LibreOffice (because I don't! > ). > > First, I will do my best to explain the current results of the ICU > break iterator with Khmer: > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > ឈ្មោះ|សិវកឥវលិយៈ > > The differences should be clear – the ICU break iterator does not > break the words with 100% accuracy. > > But, obviously with a dictionary approach, no automatic word breaker > will ever break correctly 100% of the time. There is no solution that > will currently automatically break Thai or Khmer 100% correctly (I > have used, Hidden Markov Model breakers, dictionary probability > breakers, and plain dictionary breakers – none work 100% of a time) > because, especially for names and places, words in Khmer can just defy > all rules and patterns. Perhaps in the future, a solution will arise > that can break Khmer words with 100% accuracy, but at this time, we > are far from any such solution. > > And this is an important reality to remember, because it > differentiates Thai and Khmer (and possibly other languages that do > not use spaces between words) from Western languages such as English, > where a line-breaker and word-breaker can be correct 100% of the time. > > As an end user, this inability of the ICU break iterator to break > Khmer words with 100% causes usability issues when it comes to > correcting the automatic breaks that are broken in error. > > Here are some reasons why: > </div></div>> 1. In LibreOffice a user cannot see where the words have been > broken, they are invisible. > > 2. Therefore, trying to use a U+2060 (No Width Word Joiner) to <div class="im">> correct an error in order to correctly spell check is very > difficult, because the user cannot see where to place the > joiner in order to join the word (as in the example case above > the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters > to join it to be treated as one word, but the end user does > not know this because the breaks are invisible. </div>FWIW with view->field shading on you should see a little gray mark where the word joiner exists. At least I do anyway. > 1. Even if LibreOffice were able to change their code so that the <div class="im">> end user could see the word-breaks, adding three U+2060 > characters is quite laborious just to fix one word so that it > can be spell checked correctly (as one word, rather than spell > checked as four individual words). > > > > One possible solution to this issue is by how the ICU Break Iterator > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU > code was enabled to automatically break Khmer, if an end-user wanted > to spell check Khmer, they had to manually place U+200B characters to > separate words. This solution worked quite well, but was > counterintuitive to most native speakers, because Khmer has no spaces > (as stated before). But with this solution, an end-user could be sure > that their document was broken with 100% accuracy, if there was no > human error (something automatic solutions cannot do – it is more > along the lines of 80% accurate). What I propose, is that the break > iterator code in LibreOffice looks for U+200B characters in a given > string and considers them as a sign to NOT automatically break, but to > allow the end-user full control to manually break words. Let me > explain: > </div>> 1. The code starts processing the text and automatically breaking <div class="im">> it until it comes across a U+200B character. If one is found, > it searches to see if there are any additional U+200B or U > +0020 characters in the following 20 characters (or so), and > if there are, the break iterator skips over those characters > and starts again from the second U+200B character (or U+0020, > but a U+0020 character would only signify the “close” of the > manual break because sometimes a phrase will end and there > will be an actual space – so if the word that the user wants > to manually break has a “real” U+0020 space at the end of it, > then the user does not need to put an additional U+200B > character to close it) which then repeats, looking for U+200B > characters etc. > </div>> 2. This would allow end-users to choose to manually break their <div class="im">> whole document so they can have precise control, as well as > allow end-users to place U+200B characters around names of > people, places or transliterations in order to tell the break > iterator to not try to break those words. > > > > An example of what it would like (since I am not a programmer, I am > not sure if this would be the best way to add this feature, you > probably can come up with a better solution, but I thought by giving > this example it would explain what I mean better); <zw> is the > zero-width space U+200B character and <sp> is a normal space U+0020: > </div>> 1. I type: <div class="im">> មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ and since there are no zero-width spaces, the break iterator works breaks the words automatically. > </div>> 2. After typing I know that the last word is the name of a <div class="im">> person, so I manually add a zero-width space before the word > (but not after since there is already a “real” space there): > មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ > </div>> 3. As the break iterator processes this string, it breaks the <div class="im">> words automatically until it comes to the zero-width space, > and then stops automatically breaking, looking for a “closing” > zero-width space or “real” space. Finding a “closing” real, it > then does not break the characters between the zero width > space and the “real” space. > </div>> 4. I then notice that "ម្នាក់ទៀត" line breaks together (since the <div class="im">> automatic line-breaking breaks them as one word. And I decide > I would rather line-break after “ម្នាក់” rather than have both > words break connected to each other, so I place a zero-width > space between the words: > មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ > ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ > the automatic break iterator comes to the zero width space and > then stops automatically breaking and look ahead to see if > there is a zero-width space or a “real” space within 20 > characters (this number might need refining, but I think 20 > characters would be enough). As there are no zero-width or > “real” spaces within 20 characters, the break iterator then > goes back to the previous zero-width and starts breaking > starting from the zero-width character. > </div>> 5. Then I notice that "ជាងគេ" line-break together, and I want > them to break seperately. So I add a zero-with space of either > side of the word "ជាង": មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥ > វលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់<zw>ទៀតដែល > ល្បីល្បាញ<zw>ជាង<zw>គេ > The automatic breaker then comes to the zero-width space > before "ជាង" and looks for a “closing” zero-width space or > normal space, finding one, it does not break anything between > the two zero-width spaces, and begins breaking again after the > “closing” zero-width space. > > > I hope that isn't too much information, but I wanted to try and cover > every possibility to make sure the concept was clear. > > I noticed some code in the LibreOffice source that gave me the idea > that it might be possible to do what I just described – I found it > online here: > <a href="http://c-cpp.r3dcode.com/files/LibreOffice/3/4.5.2/libs-gui/i18npool/source/breakiterator/breakiterator_unicode.cxx" target="_blank">http://c-cpp.r3dcode.com/files/LibreOffice/3/4.5.2/libs-gui/i18npool/source/breakiterator/breakiterator_unicode.cxx</a> > > > 403 #define WJ 0x2060 // Word Joiner > 404 GlueSpace=sal_False; > 405 if (lbr.breakType == BreakType::WORDBOUNDARY) { > 406 nStartPos = lbr.breakIndex; > 407 if (Text[nStartPos--] == WJ) > 408 GlueSpace=sal_True; > 409 while (nStartPos >= 0 && > 410 (u_isWhitespace(Text.iterateCodePoints(&nStartPos, 0)) || Text[nStartPos] == WJ)) { > 411 if (Text[nStartPos--] == WJ) > 412 GlueSpace=sal_True; > 413 } > 414 if (GlueSpace && nStartPos < 0) { > 415 lbr.breakIndex = 0; > 416 break; > 417 } > 418 } > 419 } > 420 > 421 return lbr; > 422 } > > Again, I am very grateful for your time and help in making LibreOffice > work better with Khmer, and I know your time is valuable, but I > thought I would try to see if you could provide additional help and > solutions for Khmer. If you don't have time to consider this, no > worries, we are already grateful for what you have already done. > > > Thanks, > > Nathan > > > On Fri, Jul 13, 2012 at 3:23 PM, Caolán McNamara <<a href="mailto:caolanm@redhat.com">caolanm@redhat.com</a>> > wrote: > On Thu, 2012-07-12 at 23:49 +0700, Nathan Wells wrote: > > > > > There was something similar done in the past IIRC to > > > pass around soft-page-break information so that export > filters > >> could know where the layout last put the page breaks. I > forget > >> the details of that though. > > > > This would be a very useful feature for Cambodians (and I > would assume > > Thai as well, although Thai tends to have more programs that > currently > > support wordbreaking already) - would it be best to seek to > do this > > with an extension rather than LibreOffice core? > > > Here's the spec for soft-page-breaks, which seems similar to > my mind > <a href="http://www.openoffice.org/specs/writer/SoftPageBreak/SoftPageBreak.odt" target="_blank">http://www.openoffice.org/specs/writer/SoftPageBreak/SoftPageBreak.odt</a> > This sort of thing would have to be basically implemented in > the core I > reckon. > > C. > > > > </blockquote></div> </div>