Adding Extension for Experimental Thai Spelling

Nathan Wells sungkhum at gmail.com
Wed Jul 25 07:08:23 PDT 2012


Thanks for your reply.

Yes, a  "view->word boundaries"  mode would be very helpful (or
even incorporating the current "view->field shading" to include viewing
'gray marks' at the automatic ICU breaking so that users can see what is
being done). Would this be hard to implement?

Also, we are making some changes to the ICU break iterator dictionary for
Khmer - and I've heard there will be some changes in ICU 50 which should
improve results for Khmer.

If anyone has any ideas - it would be appreciated.

Thanks!
Nathan


On Wed, Jul 25, 2012 at 8:41 PM, Caolán McNamara <caolanm at redhat.com> wrote:

> I'll cc this to the list if you don't mind, in order to archive it. I
> have no immediate great ideas. But I wonder if a "view->word boundaries"
> mode would be helpful, i.e. something that indicates the boundaries of
> the words that the software thinks exist.
>
> On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
> >
> > I hope you don't mind if I write and ask some more questions and ask
> > for additional help in making the break iterator more functional in
> > LibreOffice. Thank you again for your help implementing ICU for Khmer
> > in LibreOffice. I downloaded a recent beta build with your code
> > implemented and did some testing – it is great! But it also brought to
> > my attention some issues that hamper the useability of the automatic
> > breaking for Khmer (and I also believe for Thai – see this discussion
> > -
> >
> http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455
> ).
> >
> >
> > An automatic word and line breaker is very necessary for Khmer and
> > Thai because traditionally they have no spaces between words, and so
> > line-breaking and spell checking require the use of a zero-width space
> > between words which is counterintuitive for most native speakers, and
> > so spell checking goes widely unused.
> > But now with the ICU code you implemented, Thai and Khmer can be
> > automatically broken, and the results are quite good. But with its
> > implementation in the real world, I have found some issues that I
> > wanted to raise and also suggest possible solutions. I write this as
> > an end-user, not so much as a programmer, nor do I claim to fully
> > understand the inner-workings of ICU and LibreOffice (because I don't!
> > ).
> >
> > First, I will do my best to explain the current results of the ICU
> > break iterator with Khmer:
> >
> > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> >
> > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> >
> > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > ឈ្មោះ|សិវកឥវលិយៈ
> >
> > The differences should be clear – the ICU break iterator does not
> > break the words with 100% accuracy.
> >
> > But, obviously with a dictionary approach, no automatic word breaker
> > will ever break correctly 100% of the time. There is no solution that
> > will currently automatically break Thai or Khmer 100% correctly (I
> > have used, Hidden Markov Model breakers, dictionary probability
> > breakers, and plain dictionary breakers – none work 100% of a time)
> > because, especially for names and places, words in Khmer can just defy
> > all rules and patterns. Perhaps in the future, a solution will arise
> > that can break Khmer words with 100% accuracy, but at this time, we
> > are far from any such solution.
> >
> > And this is an important reality to remember, because it
> > differentiates Thai and Khmer (and possibly other languages that do
> > not use spaces between words) from Western languages such as English,
> > where a line-breaker and word-breaker can be correct 100% of the time.
> >
> > As an end user, this inability of the ICU break iterator to break
> > Khmer words with 100% causes usability issues when it comes to
> > correcting the automatic breaks that are broken in error.
> >
> > Here are some reasons why:
> >
> >      1. In LibreOffice a user cannot see where the words have been
> >         broken, they are invisible.
> >
> >      2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
> >         correct an error in order to correctly spell check is very
> >         difficult, because the user cannot see where to place the
> >         joiner in order to join the word (as in the example case above
> >         the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
> >         to join it to be treated as one word, but the end user does
> >         not know this because the breaks are invisible.
>
> FWIW with view->field shading on you should see a little gray mark where
> the word joiner exists. At least I do anyway.
>
> >      1. Even if LibreOffice were able to change their code so that the
> >         end user could see the word-breaks, adding three U+2060
> >         characters is quite laborious just to fix one word so that it
> >         can be spell checked correctly (as one word, rather than spell
> >         checked as four individual words).
> >
> >
> >
> > One possible solution to this issue is by how the ICU Break Iterator
> > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> > code was enabled to automatically break Khmer, if an end-user wanted
> > to spell check Khmer, they had to manually place U+200B characters to
> > separate words. This solution worked quite well, but was
> > counterintuitive to most native speakers, because Khmer has no spaces
> > (as stated before). But with this solution, an end-user could be sure
> > that their document was broken with 100% accuracy, if there was no
> > human error (something automatic solutions cannot do – it is more
> > along the lines of 80% accurate). What I propose, is that the break
> > iterator code in LibreOffice looks for U+200B characters in a given
> > string and considers them as a sign to NOT automatically break, but to
> > allow the end-user full control to manually break words. Let me
> > explain:
> >
> >      1. The code starts processing the text and automatically breaking
> >         it until it comes across a U+200B character. If one is found,
> >         it searches to see if there are any additional U+200B or U
> >         +0020 characters in the following 20 characters (or so), and
> >         if there are, the break iterator skips over those characters
> >         and starts again from the second U+200B character (or U+0020,
> >         but a U+0020 character would only signify the “close” of the
> >         manual break because sometimes a phrase will end and there
> >         will be an actual space – so if the word that the user wants
> >         to manually break has a “real” U+0020 space at the end of it,
> >         then the user does not need to put an additional U+200B
> >         character to close it) which then repeats, looking for U+200B
> >         characters etc.
> >
> >      2. This would allow end-users to choose to manually break their
> >         whole document so they can have precise control, as well as
> >         allow end-users to place U+200B characters around names of
> >         people, places or transliterations in order to tell the break
> >         iterator to not try to break those words.
> >
> >
> >
> > An example of what it would like (since I am not a programmer, I am
> > not sure if this would be the best way to add this feature, you
> > probably can come up with a better solution, but I thought by giving
> > this example it would explain what I mean better); <zw> is the
> > zero-width space U+200B character and <sp> is a normal space U+0020:
> >
> >      1. I type:
> >
> មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ
> and since there are no zero-width spaces, the break iterator works breaks
> the words automatically.
> >
> >      2. After typing I know that the last word is the name of a
> >         person, so I manually add a zero-width space before the word
> >         (but not after since there is already a “real” space there):
> >
> មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ
> >
> >      3. As the break iterator processes this string, it breaks the
> >         words automatically until it comes to the zero-width space,
> >         and then stops automatically breaking, looking for a “closing”
> >         zero-width space or “real” space. Finding a “closing” real, it
> >         then does not break the characters between the zero width
> >         space and the “real” space.
> >
> >      4. I then notice that "ម្នាក់ទៀត" line breaks together (since the
> >         automatic line-breaking breaks them as one word. And I decide
> >         I would rather line-break after “ម្នាក់” rather than have both
> >         words break connected to each other, so I place a zero-width
> >         space between the words:
> >         មាន​ប្រាជ្ញាឈ្លាស​វៃ​ឈ្មោះ<zw>សិវ​កឥ​វលិ​យៈ<sp>​អ្នកប្រាជ្ញ
> >         ម្នាក់<zw>ទៀត​ដែល​ល្បីល្បាញ​ជាងគេ
> >         the automatic break iterator comes to the zero width space and
> >         then stops automatically breaking and look ahead to see if
> >         there is a zero-width space or a “real” space within 20
> >         characters (this number might need refining, but I think 20
> >         characters would be enough). As there are no zero-width or
> >         “real” spaces within 20 characters, the break iterator then
> >         goes back to the previous zero-width and starts breaking
> >         starting from the zero-width character.
> >
> >      5. Then I notice that "ជាងគេ" line-break together, and I want
> >         them to break seperately. So I add a zero-with space of either
> >         side of the word "ជាង": មាន​ប្រាជ្ញាឈ្លាស​វៃ​ឈ្មោះ<zw>សិវ​កឥ
> >         វលិ​យៈ<sp>​អ្នកប្រាជ្ញ​ម្នាក់<zw>ទៀត​ដែល
> >         ល្បីល្បាញ​<zw>ជាង<zw>គេ
> >         The automatic breaker then comes to the zero-width space
> >         before "ជាង" and looks for a “closing” zero-width space or
> >         normal space, finding one, it does not break anything between
> >         the two zero-width spaces, and begins breaking again after the
> >         “closing” zero-width space.
> >
> >
> > I hope that isn't too much information, but I wanted to try and cover
> > every possibility to make sure the concept was clear.
> >
> > I noticed some code in the LibreOffice source that gave me the idea
> > that it might be possible to do what I just described – I found it
> > online here:
> >
> http://c-cpp.r3dcode.com/files/LibreOffice/3/4.5.2/libs-gui/i18npool/source/breakiterator/breakiterator_unicode.cxx
> >
> >
> > 403 #define WJ 0x2060   // Word Joiner
> > 404             GlueSpace=sal_False;
> > 405             if (lbr.breakType == BreakType::WORDBOUNDARY) {
> > 406                 nStartPos = lbr.breakIndex;
> > 407                 if (Text[nStartPos--] == WJ)
> > 408                     GlueSpace=sal_True;
> > 409                 while (nStartPos >= 0 &&
> > 410
> (u_isWhitespace(Text.iterateCodePoints(&nStartPos, 0)) || Text[nStartPos]
> == WJ)) {
> > 411                     if (Text[nStartPos--] == WJ)
> > 412                         GlueSpace=sal_True;
> > 413                 }
> > 414                 if (GlueSpace && nStartPos < 0)  {
> > 415                     lbr.breakIndex = 0;
> > 416                     break;
> > 417                 }
> > 418             }
> > 419         }
> > 420
> > 421         return lbr;
> > 422 }
> >
> > Again, I am very grateful for your time and help in making LibreOffice
> > work better with Khmer, and I know your time is valuable, but I
> > thought I would try to see if you could provide additional help and
> > solutions for Khmer. If you don't have time to consider this, no
> > worries, we are already grateful for what you have already done.
> >
> >
> > Thanks,
> >
> > Nathan
> >
> >
> > On Fri, Jul 13, 2012 at 3:23 PM, Caolán McNamara <caolanm at redhat.com>
> > wrote:
> >         On Thu, 2012-07-12 at 23:49 +0700, Nathan Wells wrote:
> >         >
> >         > > There was something similar done in the past IIRC to
> >         > > pass around soft-page-break information so that export
> >         filters
> >         >> could know where the layout last put the page breaks. I
> >         forget
> >         >> the details of that though.
> >         >
> >         > This would be a very useful feature for Cambodians (and I
> >         would assume
> >         > Thai as well, although Thai tends to have more programs that
> >         currently
> >         > support wordbreaking already) - would it be best to seek to
> >         do this
> >         > with an extension rather than LibreOffice core?
> >
> >
> >         Here's the spec for soft-page-breaks, which seems similar to
> >         my mind
> >
> http://www.openoffice.org/specs/writer/SoftPageBreak/SoftPageBreak.odt
> >         This sort of thing would have to be basically implemented in
> >         the core I
> >         reckon.
> >
> >         C.
> >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/libreoffice/attachments/20120725/166a28cc/attachment-0001.html>


More information about the LibreOffice mailing list