<html> <head> <base href="https://bugs.documentfoundation.org/"> </head> <body><span class="vcard"><a class="email" href="mailto:erack@redhat.com" title="Eike Rathke <erack@redhat.com>"> <span class="fn">Eike Rathke</span></a> </span> changed <a class="bz_bug_link bz_status_NEW " title="NEW - Calc: incorrect formula text function result from Unicode's non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()" href="https://bugs.documentfoundation.org/show_bug.cgi?id=97198">bug 97198</a> <br> <table border="1" cellspacing="0" cellpadding="8"> <tr> <th>What</th> <th>Removed</th> <th>Added</th> </tr> <tr> <td style="text-align:right;">Status</td> <td>UNCONFIRMED </td> <td>NEW </td> </tr> <tr> <td style="text-align:right;">Ever confirmed</td> <td> </td> <td>1 </td> </tr></table> <p> <div> <b><a class="bz_bug_link bz_status_NEW " title="NEW - Calc: incorrect formula text function result from Unicode's non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()" href="https://bugs.documentfoundation.org/show_bug.cgi?id=97198#c11">Comment # 11</a> on <a class="bz_bug_link bz_status_NEW " title="NEW - Calc: incorrect formula text function result from Unicode's non-BMP characters. Functions LEN(), LEFT(), RIGHT(), MID(), SEARCH(), FIND(), REPLACE()" href="https://bugs.documentfoundation.org/show_bug.cgi?id=97198">bug 97198</a> from <span class="vcard"><a class="email" href="mailto:erack@redhat.com" title="Eike Rathke <erack@redhat.com>"> <span class="fn">Eike Rathke</span></a> </span></b> <pre>(In reply to Winfried Donkers from <a href="show_bug.cgi?id=97198#c10">comment #10</a>) <span class="quote">> As already said in <a href="show_bug.cgi?id=97198#c4">comment #4</a>, LO uses 16bits for unicode characters, which > is not enough for the UniCode 5.2.0 (referenced to in ODFF1.2).</span > That's not true. Strings are UTF-16, which is well capable to represent all Unicode planes. However, the second example is not even about non-BMP characters, the "üë" in B1 contains the sequence U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS U+0065 LATIN SMALL LETTER E U+0308 COMBINING DIAERESIS To regard a combined character as one character, some Unicode normalization would have to take place. Which, if at all, should be implemented in a separate function. As for the first attachment, these are real non-BMP characters. For this to work correctly the mentioned spreadsheet functions should operate on the actual Unicode characters code points instead of the 16-bit units. There's OUString::iterateCodePoints() to do that properly, a change in LO's string handling is not necessary. <span class="quote">> As Excel 2016 has the same results as LO, interoperability would be affected > when LO is changed.</span > Also the same result with the first attachment and real non-BMP characters? Anyway, I couldn't find a definition for Excel how they'd treat non-BMP characters, but as LEN() is supposed to return the length in *characters*, not in 16-bit code units, I propose to actually handle it that way. I consider treating these string functions in code units wrong. Specifically with LEFT() or RIGHT() or MID() or REPLACE() that should operate on characters, otherwise with code units could cut non-BMP characters in the middle.</pre> </div> </p> <hr> <span>You are receiving this mail because:</span> <ul> <li>You are the assignee for the bug.</li> </ul> </body> </html>