Adding Extension for Experimental Thai Spelling

Thu Jul 26 02:33:00 PDT 2012

Dear All,

> > An automatic word and line breaker is very necessary for Khmer and
> > Thai because traditionally they have no spaces between words, and so
> > line-breaking and spell checking require the use of a zero-width space
> > between words which is counterintuitive for most native speakers, and
> > so spell checking goes widely unused.

I agree that automatic word breaking is a good thing and I am relieved to see that libreoffice does it based on language selection and not on automatic language guessing based on scripts. There are more languages that use Thai script and Khmer script than just Thai and Khmer. So one of my fears is already alleviated :)

> > But now with the ICU code you implemented, Thai and Khmer can be
> > automatically broken, and the results are quite good. But with its
> > implementation in the real world, I have found some issues that I
> > wanted to raise and also suggest possible solutions. I write this as
> > an end-user, not so much as a programmer, nor do I claim to fully
> > understand the inner-workings of ICU and LibreOffice (because I don't!
> > ).
> > 
> > First, I will do my best to explain the current results of the ICU
> > break iterator with Khmer:
> > 
> > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > 
> > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > 
> > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > ឈ្មោះ|សិវកឥវលិយៈ
> > 
> > The differences should be clear – the ICU break iterator does not
> > break the words with 100% accuracy.
> > 
> > One possible solution to this issue is by how the ICU Break Iterator
> > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> > code was enabled to automatically break Khmer, if an end-user wanted
> > to spell check Khmer, they had to manually place U+200B characters to
> > separate words. This solution worked quite well, but was
> > counterintuitive to most native speakers, because Khmer has no spaces
> > (as stated before). But with this solution, an end-user could be sure
> > that their document was broken with 100% accuracy, if there was no
> > human error (something automatic solutions cannot do – it is more
> > along the lines of 80% accurate). What I propose, is that the break
> > iterator code in LibreOffice looks for U+200B characters in a given
> > string and considers them as a sign to NOT automatically break, but to
> > allow the end-user full control to manually break words. Let me
> > explain:
> > 
> >      1. The code starts processing the text and automatically breaking
> >         it until it comes across a U+200B character. If one is found,
> >         it searches to see if there are any additional U+200B or U
> >         +0020 characters in the following 20 characters (or so), and
> >         if there are, the break iterator skips over those characters
> >         and starts again from the second U+200B character (or U+0020,
> >         but a U+0020 character would only signify the “close” of the
> >         manual break because sometimes a phrase will end and there
> >         will be an actual space – so if the word that the user wants
> >         to manually break has a “real” U+0020 space at the end of it,
> >         then the user does not need to put an additional U+200B
> >         character to close it) which then repeats, looking for U+200B
> >         characters etc.
> >         
> >      2. This would allow end-users to choose to manually break their
> >         whole document so they can have precise control, as well as
> >         allow end-users to place U+200B characters around names of
> >         people, places or transliterations in order to tell the break
> >         iterator to not try to break those words.

In principle I like this approach. I like the idea of being able to force breaks and non-breaks. But I don't think we are quite there with this solution yet. Here are my difficulties with it:

1. use of U+2060 makes string searching and spell checking harder (unless WJ chars are stripped for searching and spell checking). They are not part of the spelling of a word, so their introduction in the underlying text stream is problematic for other text processing processes (like searching as mentioned). This is less of an issue for U+200B ZWSP because that occurs between words and searching across word boundaries is a rarer activity. Likewise spell checking across word boundaries isn't really needed.

2. How do we come up with the range of what is considered a word between two zwsp chars as opposed to two words? How close to the end of a string must a zwsp occur to disable all breaking before the end of the string? does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we need to think harder (deeper) about the use of zwsp in this way and see if we can come up with something with a little less ambiguity. Having said that, I think we are going to have to think really hard, because I don't think this is an easy problem.

> >      4. I then notice that "ម្នាក់ទៀត" line breaks together (since the
> >         automatic line-breaking breaks them as one word. And I decide
> >         I would rather line-break after “ម្នាក់” rather than have both
> >         words break connected to each other, so I place a zero-width
> >         space between the words:
> >         មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ
> >         ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ 
> >         the automatic break iterator comes to the zero width space and
> >         then stops automatically breaking and look ahead to see if
> >         there is a zero-width space or a “real” space within 20
> >         characters (this number might need refining, but I think 20
> >         characters would be enough). As there are no zero-width or
> >         “real” spaces within 20 characters, the break iterator then
> >         goes back to the previous zero-width and starts breaking
> >         starting from the zero-width character.

Now what happens if I want to put zw around a word that occurs < 20 chars after my last zw? The on off nature of the zw has now been inverted. One option is to say that zw must always occur in pairs and you would have to bracket your first or second word there. But then management of which zw is on and which is off will get confusing for users.

An alternative model is to weight breakpoints. An explicit breakpoint weighs more highly than an automatically generated one. Then when it comes to line breaking the weight of a breakpoint counts towards its choice as to the actual break. For example if we say an explicit break is 2 and an automatic is 1. Then we might use a square rule for distance and say: an explicit break is preferred if it occurs closer to the end of a line than 4x the distance to the last automatic break on the line. Or somesuch.

Yours,
Martin