[Libreoffice-bugs] [Bug 97198] Calc: incorrect formula text function result from Unicode' s non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()
bugzilla-daemon at bugs.documentfoundation.org
bugzilla-daemon at bugs.documentfoundation.org
Thu Nov 9 12:41:31 UTC 2017
- Previous message: [Libreoffice-bugs] [Bug 97198] Calc: incorrect formula text function result from Unicode' s non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()
- Next message: [Libreoffice-bugs] [Bug 97198] Calc: incorrect formula text function result from Unicode' s non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
https://bugs.documentfoundation.org/show_bug.cgi?id=97198
Eike Rathke <erack at redhat.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #11 from Eike Rathke <erack at redhat.com> ---
(In reply to Winfried Donkers from comment #10)
> As already said in comment #4, LO uses 16bits for unicode characters, which
> is not enough for the UniCode 5.2.0 (referenced to in ODFF1.2).
That's not true. Strings are UTF-16, which is well capable to represent all
Unicode planes.
However, the second example is not even about non-BMP characters, the "üë" in
B1 contains the sequence
U+0075 LATIN SMALL LETTER U
U+0308 COMBINING DIAERESIS
U+0065 LATIN SMALL LETTER E
U+0308 COMBINING DIAERESIS
To regard a combined character as one character, some Unicode normalization
would have to take place. Which, if at all, should be implemented in a separate
function.
As for the first attachment, these are real non-BMP characters. For this to
work correctly the mentioned spreadsheet functions should operate on the actual
Unicode characters code points instead of the 16-bit units. There's
OUString::iterateCodePoints() to do that properly, a change in LO's string
handling is not necessary.
> As Excel 2016 has the same results as LO, interoperability would be affected
> when LO is changed.
Also the same result with the first attachment and real non-BMP characters?
Anyway, I couldn't find a definition for Excel how they'd treat non-BMP
characters, but as LEN() is supposed to return the length in *characters*, not
in 16-bit code units, I propose to actually handle it that way. I consider
treating these string functions in code units wrong. Specifically with LEFT()
or RIGHT() or MID() or REPLACE() that should operate on characters, otherwise
with code units could cut non-BMP characters in the middle.
--
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice-bugs/attachments/20171109/e538e72b/attachment.html>
- Previous message: [Libreoffice-bugs] [Bug 97198] Calc: incorrect formula text function result from Unicode' s non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()
- Next message: [Libreoffice-bugs] [Bug 97198] Calc: incorrect formula text function result from Unicode' s non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()
- Messages sorted by:
[ date ]
[ thread ]
[ subject ]
[ author ]
More information about the Libreoffice-bugs
mailing list