Adding Extension for Experimental Thai Spelling

Wed Jul 25 06:41:54 PDT 2012

I'll cc this to the list if you don't mind, in order to archive it. I
have no immediate great ideas. But I wonder if a "view->word boundaries"
mode would be helpful, i.e. something that indicates the boundaries of
the words that the software thinks exist.

On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
> 
> I hope you don't mind if I write and ask some more questions and ask
> for additional help in making the break iterator more functional in
> LibreOffice. Thank you again for your help implementing ICU for Khmer
> in LibreOffice. I downloaded a recent beta build with your code
> implemented and did some testing – it is great! But it also brought to
> my attention some issues that hamper the useability of the automatic
> breaking for Khmer (and I also believe for Thai – see this discussion
> -
> http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455). 
> 
> 
> An automatic word and line breaker is very necessary for Khmer and
> Thai because traditionally they have no spaces between words, and so
> line-breaking and spell checking require the use of a zero-width space
> between words which is counterintuitive for most native speakers, and
> so spell checking goes widely unused.
> But now with the ICU code you implemented, Thai and Khmer can be
> automatically broken, and the results are quite good. But with its
> implementation in the real world, I have found some issues that I
> wanted to raise and also suggest possible solutions. I write this as
> an end-user, not so much as a programmer, nor do I claim to fully
> understand the inner-workings of ICU and LibreOffice (because I don't!
> ).
> 
> First, I will do my best to explain the current results of the ICU
> break iterator with Khmer:
> 
> Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> 
> Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> 
> Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> ឈ្មោះ|សិវកឥវលិយៈ
> 
> The differences should be clear – the ICU break iterator does not
> break the words with 100% accuracy.
> 
> But, obviously with a dictionary approach, no automatic word breaker
> will ever break correctly 100% of the time. There is no solution that
> will currently automatically break Thai or Khmer 100% correctly (I
> have used, Hidden Markov Model breakers, dictionary probability
> breakers, and plain dictionary breakers – none work 100% of a time)
> because, especially for names and places, words in Khmer can just defy
> all rules and patterns. Perhaps in the future, a solution will arise
> that can break Khmer words with 100% accuracy, but at this time, we
> are far from any such solution.
> 
> And this is an important reality to remember, because it
> differentiates Thai and Khmer (and possibly other languages that do
> not use spaces between words) from Western languages such as English,
> where a line-breaker and word-breaker can be correct 100% of the time.
> 
> As an end user, this inability of the ICU break iterator to break
> Khmer words with 100% causes usability issues when it comes to
> correcting the automatic breaks that are broken in error.
> 
> Here are some reasons why:
> 
>      1. In LibreOffice a user cannot see where the words have been
>         broken, they are invisible.
>         
>      2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
>         correct an error in order to correctly spell check is very
>         difficult, because the user cannot see where to place the
>         joiner in order to join the word (as in the example case above
>         the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
>         to join it to be treated as one word, but the end user does
>         not know this because the breaks are invisible.

FWIW with view->field shading on you should see a little gray mark where
the word joiner exists. At least I do anyway.

>      1. Even if LibreOffice were able to change their code so that the
>         end user could see the word-breaks, adding three U+2060
>         characters is quite laborious just to fix one word so that it
>         can be spell checked correctly (as one word, rather than spell
>         checked as four individual words).
>         
> 
> 
> One possible solution to this issue is by how the ICU Break Iterator
> interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> code was enabled to automatically break Khmer, if an end-user wanted
> to spell check Khmer, they had to manually place U+200B characters to
> separate words. This solution worked quite well, but was
> counterintuitive to most native speakers, because Khmer has no spaces
> (as stated before). But with this solution, an end-user could be sure
> that their document was broken with 100% accuracy, if there was no
> human error (something automatic solutions cannot do – it is more
> along the lines of 80% accurate). What I propose, is that the break
> iterator code in LibreOffice looks for U+200B characters in a given
> string and considers them as a sign to NOT automatically break, but to
> allow the end-user full control to manually break words. Let me
> explain:
> 
>      1. The code starts processing the text and automatically breaking
>         it until it comes across a U+200B character. If one is found,
>         it searches to see if there are any additional U+200B or U
>         +0020 characters in the following 20 characters (or so), and
>         if there are, the break iterator skips over those characters
>         and starts again from the second U+200B character (or U+0020,
>         but a U+0020 character would only signify the “close” of the
>         manual break because sometimes a phrase will end and there
>         will be an actual space – so if the word that the user wants
>         to manually break has a “real” U+0020 space at the end of it,
>         then the user does not need to put an additional U+200B
>         character to close it) which then repeats, looking for U+200B
>         characters etc.
>         
>      2. This would allow end-users to choose to manually break their
>         whole document so they can have precise control, as well as
>         allow end-users to place U+200B characters around names of
>         people, places or transliterations in order to tell the break
>         iterator to not try to break those words.
>         
> 
> 
> An example of what it would like (since I am not a programmer, I am
> not sure if this would be the best way to add this feature, you
> probably can come up with a better solution, but I thought by giving
> this example it would explain what I mean better); <zw> is the
> zero-width space U+200B character and <sp> is a normal space U+0020:
> 
>      1. I type:
>         មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ and since there are no zero-width spaces, the break iterator works breaks the words automatically.
>         
>      2. After typing I know that the last word is the name of a
>         person, so I manually add a zero-width space before the word
>         (but not after since there is already a “real” space there):
>         មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ
>         
>      3. As the break iterator processes this string, it breaks the
>         words automatically until it comes to the zero-width space,
>         and then stops automatically breaking, looking for a “closing”
>         zero-width space or “real” space. Finding a “closing” real, it
>         then does not break the characters between the zero width
>         space and the “real” space.
>         
>      4. I then notice that "ម្នាក់ទៀត" line breaks together (since the
>         automatic line-breaking breaks them as one word. And I decide
>         I would rather line-break after “ម្នាក់” rather than have both
>         words break connected to each other, so I place a zero-width
>         space between the words:
>         មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ
>         ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ 
>         the automatic break iterator comes to the zero width space and
>         then stops automatically breaking and look ahead to see if
>         there is a zero-width space or a “real” space within 20
>         characters (this number might need refining, but I think 20
>         characters would be enough). As there are no zero-width or
>         “real” spaces within 20 characters, the break iterator then
>         goes back to the previous zero-width and starts breaking
>         starting from the zero-width character.
>         
>      5. Then I notice that "ជាងគេ" line-break together, and I want
>         them to break seperately. So I add a zero-with space of either
>         side of the word "ជាង": មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥ
>         វលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់<zw>ទៀតដែល
>         ល្បីល្បាញ<zw>ជាង<zw>គេ
>         The automatic breaker then comes to the zero-width space
>         before "ជាង" and looks for a “closing” zero-width space or
>         normal space, finding one, it does not break anything between
>         the two zero-width spaces, and begins breaking again after the
>         “closing” zero-width space.
>         
> 
> I hope that isn't too much information, but I wanted to try and cover
> every possibility to make sure the concept was clear.
> 
> I noticed some code in the LibreOffice source that gave me the idea
> that it might be possible to do what I just described – I found it
> online here:
> http://c-cpp.r3dcode.com/files/LibreOffice/3/4.5.2/libs-gui/i18npool/source/breakiterator/breakiterator_unicode.cxx
> 
> 
> 403 #define WJ 0x2060   // Word Joiner
> 404             GlueSpace=sal_False;
> 405             if (lbr.breakType == BreakType::WORDBOUNDARY) {
> 406                 nStartPos = lbr.breakIndex;
> 407                 if (Text[nStartPos--] == WJ)
> 408                     GlueSpace=sal_True;
> 409                 while (nStartPos >= 0 &&
> 410                     (u_isWhitespace(Text.iterateCodePoints(&nStartPos, 0)) || Text[nStartPos] == WJ)) {
> 411                     if (Text[nStartPos--] == WJ)
> 412                         GlueSpace=sal_True;
> 413                 }
> 414                 if (GlueSpace && nStartPos < 0)  {
> 415                     lbr.breakIndex = 0;
> 416                     break;
> 417                 }
> 418             }
> 419         }
> 420 
> 421         return lbr;
> 422 }
> 
> Again, I am very grateful for your time and help in making LibreOffice
> work better with Khmer, and I know your time is valuable, but I
> thought I would try to see if you could provide additional help and
> solutions for Khmer. If you don't have time to consider this, no
> worries, we are already grateful for what you have already done.
> 
> 
> Thanks,
> 
> Nathan
> 
> 
> On Fri, Jul 13, 2012 at 3:23 PM, Caolán McNamara <caolanm at redhat.com>
> wrote:
>         On Thu, 2012-07-12 at 23:49 +0700, Nathan Wells wrote:
>         >
>         > > There was something similar done in the past IIRC to
>         > > pass around soft-page-break information so that export
>         filters
>         >> could know where the layout last put the page breaks. I
>         forget
>         >> the details of that though.
>         >
>         > This would be a very useful feature for Cambodians (and I
>         would assume
>         > Thai as well, although Thai tends to have more programs that
>         currently
>         > support wordbreaking already) - would it be best to seek to
>         do this
>         > with an extension rather than LibreOffice core?
>         
>         
>         Here's the spec for soft-page-breaks, which seems similar to
>         my mind
>         http://www.openoffice.org/specs/writer/SoftPageBreak/SoftPageBreak.odt
>         This sort of thing would have to be basically implemented in
>         the core I
>         reckon.
>         
>         C.
>         
>         
> 
>