[HarfBuzz] script segmentation

Wed Feb 14 04:01:55 UTC 2018

Dear All,

One problem I am facing as we add characters to Unicode, is that if a character is added to a block, it doesn't necessarily mean that an existing application will keep that character in the same run as other characters in the same script of that block. This means the app is broken until the character is published in a future Unicode standard, a library is updated, and the application is updated to use the new version of the library. It also makes it impossible to test out proposed changes to Unicode. It would be great if we could come up with a standard script segmentation algorithm for runs of text that is also somewhat future proof, even if it is not perfect and changes in the future. A best guess at what script an unknown character may take has a much higher probability of being correct than to give it a special script category of unknown, which is always going to be wrong.

So.

1. Do we have a standard algorithm for this?
2. Do we want one?
3. How can we make it more future resilient?

TIA,
Yours,
Martin