Adding Extension for Experimental Thai Spelling

Thu Jul 12 09:09:56 PDT 2012

On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote:
> I have two questions: is there a way to have the LibreOffice spelling
> checker (Hunspell) also recognize word-breaks using the ICU break iterator
> for Khmer so that Cambodians no longer have to add zero-width spaces
> manually (as it seems to work for Thai now?)? Currently, lines without
> zero-width spaces are seen as one long word to the spelling checker in
> LibreOffice 3.6. But since the line-breaking is working, it would seem
> breaking words for the spelling checker should also be able to work. Should
> I submit a bug? How should I proceed?

Sounds like a bug really. I mean, hunspell itself generally doesn't do
the parsing of text into words, the app gives each word to hunspell. And
we're *supposed* to be using the icu breakiterator to split words. I
suspect its a similar bug as this original one.

So... sure, file a bug, assign it to me (caolanm at redhat.com) and paste a
short two word example text into the bug and indicate where the word
break should be and I'll add a regression test for it and see if its a
trivial fix for Khmer too now that we're using the latest-and-greatest
icu.

> Also, since many other programs do not incorporate ICU's code, is there a
> way to make the line breaks "real" when a document is saved in another
> format (such as a .doc?). And by "real" I mean that a zero-width space is
> actually added to the text where a line-break should be.

That should at least be theoretically possible, albeit a bit tricky
seeing as the layout code is the bit that knows the width of the page
and does the line breaking, while the export filters don't get to know
that information. There was something similar done in the past IIRC to
pass around soft-page-break information so that export filters could
know where the layout last put the page breaks. I forget the details of
that though.

C.