[Bug 163616] Match diacritics

Sat Oct 26 07:52:08 UTC 2024

https://bugs.documentfoundation.org/show_bug.cgi?id=163616

--- Comment #3 from madhavkiran.sodum at gmail.com ---
I definitely respect the previous thoughts and decision. Just a few days with
LibreOffice has convinced me that this is a way more powerful tool than
Microsoft Office. I just don't feel like going back to Microsoft Office now.

But some more thoughts for consideration in this matter:

1. The "diacritic-sensitive" check box doesn't seem to be synchronized across
all the find interfaces. I have it turned on in the Find & Replace dialogue but
it doesn't work in the quick find toolbar.

2. Whole Word Match problem can still be quite easily circumvented for most
cases by have spaces before and after the search tag.

3. Match Kashida seems to really be a different issue. Kashida is not a
character but a code used to help in justifying the text. It is optional in
Arabic and also applicable only for Arabic script. It is added by default by
some programs. Diacritics/Accents are not like that. They are not optional
(they can change the meaning of the words). Thus even ASCII (though a very
basic implementation) has diacritical marks included. And so does ISO/IEC
8859-1:1998 ("extended" ASCII and the like). The languages which use Roman +
accents/diacritics are definitely a lot more than those that use Arabic
scripts.

4. Regular Expression are definitely non-trivial searches and are rightly not
placed in the quick search toolbar.

5. Right now Lo's implementation doesn't follow the collation strength rules.
We can search while ignoring case and accents but not just case (since accent
is ignored by default). Ideally IMHO we should have simple option to toggle
between the first three collation strengths:

[Quote]
The Strength attribute determines whether accent or case is taken into account
when collating or comparing text strings. In writing systems without case or
accent, the Strength attribute controls similarly important features.
The possible values are: primary (1), secondary (2), tertiary (3), quaternary
(4), and identity (I). 

To ignore:

    —accent and case, use the primary strength level
    —case only, use the secondary strength level
    —neither accent nor case, use the tertiary strength level

Almost all characters can be distinguished by the first three strength levels,
therefore in most locales the default Strength attribute is set at the tertiary
level. However if the Alternate attribute (described in a following row) is set
to shifted, then the quaternary strength level can be used to break ties among
white space characters, punctuation marks, and symbols that would otherwise be
ignored.
[End of Quote]
https://www.ibm.com/docs/en/db2/11.5?topic=collation-unicode-algorithm-based-collations
https://www.php.net/manual/en/collator.setstrength.php

with the Match Case check box we can switch between strength level 2 & 3. And
by implementing Match Diacritics check box we could switch between strength
level 1 & 2.

6. Match Case and Match Diacritics are really very similar in implementation.
With Match Case off: "s" matches  "s", "S" and "ß" (sharp S used in the German
language). Where as with Match Diacritics off: "s" would match "s", "ś", "ṣ",
etc.

7. Diacritics have become so common on Android phone (even US QWERTY has so
many accented characters on long press), Mac, iPhone in English language.
Kashida is found only in one language and really cannot be typed with the ease
with which we can type diacritics today.

8. Globalization has made it necessary to work with many languages in a single
document and English itself has loan words from other accented languages...

9. If the sidebar find and the quick find toolbar have almost the same
features, then the quick find is not being extra advantageous. 

10. Unfortunately, one of the most powerful features of LO - customization is
also not able to solve this issue as Match Diacritics cannot be a .uno.

Regards

-- 
You are receiving this mail because:
You are on the CC list for the bug.