Adding Extension for Experimental Thai Spelling
Caolán McNamara
caolanm at redhat.com
Wed Jul 25 06:41:54 PDT 2012
I'll cc this to the list if you don't mind, in order to archive it. I
have no immediate great ideas. But I wonder if a "view->word boundaries"
mode would be helpful, i.e. something that indicates the boundaries of
the words that the software thinks exist.
On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
>
> I hope you don't mind if I write and ask some more questions and ask
> for additional help in making the break iterator more functional in
> LibreOffice. Thank you again for your help implementing ICU for Khmer
> in LibreOffice. I downloaded a recent beta build with your code
> implemented and did some testing – it is great! But it also brought to
> my attention some issues that hamper the useability of the automatic
> breaking for Khmer (and I also believe for Thai – see this discussion
> -
> http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455).
>
>
> An automatic word and line breaker is very necessary for Khmer and
> Thai because traditionally they have no spaces between words, and so
> line-breaking and spell checking require the use of a zero-width space
> between words which is counterintuitive for most native speakers, and
> so spell checking goes widely unused.
> But now with the ICU code you implemented, Thai and Khmer can be
> automatically broken, and the results are quite good. But with its
> implementation in the real world, I have found some issues that I
> wanted to raise and also suggest possible solutions. I write this as
> an end-user, not so much as a programmer, nor do I claim to fully
> understand the inner-workings of ICU and LibreOffice (because I don't!
> ).
>
> First, I will do my best to explain the current results of the ICU
> break iterator with Khmer:
>
> Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
>
> Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
>
> Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> ឈ្មោះ|សិវកឥវលិយៈ
>
> The differences should be clear – the ICU break iterator does not
> break the words with 100% accuracy.
>
> But, obviously with a dictionary approach, no automatic word breaker
> will ever break correctly 100% of the time. There is no solution that
> will currently automatically break Thai or Khmer 100% correctly (I
> have used, Hidden Markov Model breakers, dictionary probability
> breakers, and plain dictionary breakers – none work 100% of a time)
> because, especially for names and places, words in Khmer can just defy
> all rules and patterns. Perhaps in the future, a solution will arise
> that can break Khmer words with 100% accuracy, but at this time, we
> are far from any such solution.
>
> And this is an important reality to remember, because it
> differentiates Thai and Khmer (and possibly other languages that do
> not use spaces between words) from Western languages such as English,
> where a line-breaker and word-breaker can be correct 100% of the time.
>
> As an end user, this inability of the ICU break iterator to break
> Khmer words with 100% causes usability issues when it comes to
> correcting the automatic breaks that are broken in error.
>
> Here are some reasons why:
>
> 1. In LibreOffice a user cannot see where the words have been
> broken, they are invisible.
>
> 2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
> correct an error in order to correctly spell check is very
> difficult, because the user cannot see where to place the
> joiner in order to join the word (as in the example case above
> the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
> to join it to be treated as one word, but the end user does
> not know this because the breaks are invisible.
FWIW with view->field shading on you should see a little gray mark where
the word joiner exists. At least I do anyway.
> 1. Even if LibreOffice were able to change their code so that the
> end user could see the word-breaks, adding three U+2060
> characters is quite laborious just to fix one word so that it
> can be spell checked correctly (as one word, rather than spell
> checked as four individual words).
>
>
>
> One possible solution to this issue is by how the ICU Break Iterator
> interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> code was enabled to automatically break Khmer, if an end-user wanted
> to spell check Khmer, they had to manually place U+200B characters to
> separate words. This solution worked quite well, but was
> counterintuitive to most native speakers, because Khmer has no spaces
> (as stated before). But with this solution, an end-user could be sure
> that their document was broken with 100% accuracy, if there was no
> human error (something automatic solutions cannot do – it is more
> along the lines of 80% accurate). What I propose, is that the break
> iterator code in LibreOffice looks for U+200B characters in a given
> string and considers them as a sign to NOT automatically break, but to
> allow the end-user full control to manually break words. Let me
> explain:
>
> 1. The code starts processing the text and automatically breaking
> it until it comes across a U+200B character. If one is found,
> it searches to see if there are any additional U+200B or U
> +0020 characters in the following 20 characters (or so), and
> if there are, the break iterator skips over those characters
> and starts again from the second U+200B character (or U+0020,
> but a U+0020 character would only signify the “close” of the
> manual break because sometimes a phrase will end and there
> will be an actual space – so if the word that the user wants
> to manually break has a “real” U+0020 space at the end of it,
> then the user does not need to put an additional U+200B
> character to close it) which then repeats, looking for U+200B
> characters etc.
>
> 2. This would allow end-users to choose to manually break their
> whole document so they can have precise control, as well as
> allow end-users to place U+200B characters around names of
> people, places or transliterations in order to tell the break
> iterator to not try to break those words.
>
>
>
> An example of what it would like (since I am not a programmer, I am
> not sure if this would be the best way to add this feature, you
> probably can come up with a better solution, but I thought by giving
> this example it would explain what I mean better); <zw> is the
> zero-width space U+200B character and <sp> is a normal space U+0020:
>
> 1. I type:
> មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ and since there are no zero-width spaces, the break iterator works breaks the words automatically.
>
> 2. After typing I know that the last word is the name of a
> person, so I manually add a zero-width space before the word
> (but not after since there is already a “real” space there):
> មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់ទៀតដែលល្បីល្បាញជាងគេ
>
> 3. As the break iterator processes this string, it breaks the
> words automatically until it comes to the zero-width space,
> and then stops automatically breaking, looking for a “closing”
> zero-width space or “real” space. Finding a “closing” real, it
> then does not break the characters between the zero width
> space and the “real” space.
>
> 4. I then notice that "ម្នាក់ទៀត" line breaks together (since the
> automatic line-breaking breaks them as one word. And I decide
> I would rather line-break after “ម្នាក់” rather than have both
> words break connected to each other, so I place a zero-width
> space between the words:
> មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ
> ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ
> the automatic break iterator comes to the zero width space and
> then stops automatically breaking and look ahead to see if
> there is a zero-width space or a “real” space within 20
> characters (this number might need refining, but I think 20
> characters would be enough). As there are no zero-width or
> “real” spaces within 20 characters, the break iterator then
> goes back to the previous zero-width and starts breaking
> starting from the zero-width character.
>
> 5. Then I notice that "ជាងគេ" line-break together, and I want
> them to break seperately. So I add a zero-with space of either
> side of the word "ជាង": មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥ
> វលិយៈ<sp>អ្នកប្រាជ្ញម្នាក់<zw>ទៀតដែល
> ល្បីល្បាញ<zw>ជាង<zw>គេ
> The automatic breaker then comes to the zero-width space
> before "ជាង" and looks for a “closing” zero-width space or
> normal space, finding one, it does not break anything between
> the two zero-width spaces, and begins breaking again after the
> “closing” zero-width space.
>
>
> I hope that isn't too much information, but I wanted to try and cover
> every possibility to make sure the concept was clear.
>
> I noticed some code in the LibreOffice source that gave me the idea
> that it might be possible to do what I just described – I found it
> online here:
> http://c-cpp.r3dcode.com/files/LibreOffice/3/4.5.2/libs-gui/i18npool/source/breakiterator/breakiterator_unicode.cxx
>
>
> 403 #define WJ 0x2060 // Word Joiner
> 404 GlueSpace=sal_False;
> 405 if (lbr.breakType == BreakType::WORDBOUNDARY) {
> 406 nStartPos = lbr.breakIndex;
> 407 if (Text[nStartPos--] == WJ)
> 408 GlueSpace=sal_True;
> 409 while (nStartPos >= 0 &&
> 410 (u_isWhitespace(Text.iterateCodePoints(&nStartPos, 0)) || Text[nStartPos] == WJ)) {
> 411 if (Text[nStartPos--] == WJ)
> 412 GlueSpace=sal_True;
> 413 }
> 414 if (GlueSpace && nStartPos < 0) {
> 415 lbr.breakIndex = 0;
> 416 break;
> 417 }
> 418 }
> 419 }
> 420
> 421 return lbr;
> 422 }
>
> Again, I am very grateful for your time and help in making LibreOffice
> work better with Khmer, and I know your time is valuable, but I
> thought I would try to see if you could provide additional help and
> solutions for Khmer. If you don't have time to consider this, no
> worries, we are already grateful for what you have already done.
>
>
> Thanks,
>
> Nathan
>
>
> On Fri, Jul 13, 2012 at 3:23 PM, Caolán McNamara <caolanm at redhat.com>
> wrote:
> On Thu, 2012-07-12 at 23:49 +0700, Nathan Wells wrote:
> >
> > > There was something similar done in the past IIRC to
> > > pass around soft-page-break information so that export
> filters
> >> could know where the layout last put the page breaks. I
> forget
> >> the details of that though.
> >
> > This would be a very useful feature for Cambodians (and I
> would assume
> > Thai as well, although Thai tends to have more programs that
> currently
> > support wordbreaking already) - would it be best to seek to
> do this
> > with an extension rather than LibreOffice core?
>
>
> Here's the spec for soft-page-breaks, which seems similar to
> my mind
> http://www.openoffice.org/specs/writer/SoftPageBreak/SoftPageBreak.odt
> This sort of thing would have to be basically implemented in
> the core I
> reckon.
>
> C.
>
>
>
>
More information about the LibreOffice
mailing list