Your presentation on LibreOffice code

Sat Mar 21 17:02:52 UTC 2020

On Fri, 20 Mar 2020 Jan-Marek Glogowski wrote:

> Hmm - I know fcitx uses some kind of tables for the direct mappings. My
> Debian has fcitx-table-emoji. Guess that would be the easiest starting
> point, if your languages typed letters don't depend already existing
> previous or next letters and just need some keys to code point mapping.
There are two separate issues here - keyboard input and display of the glyph. Leaving aside for the moment the input mechanism assuming that I have done what you suggest, I'd like to understand the code dealing with the display mechanism in LO. This is because even if some external method did the input mappings and the keycode came into LO as a result of those mappings, the problem here is that although everything works fine in the case of copy-paste, it is not the same with keyboard input. 

In the case of keyboard input, the keycodes that have a value above 65535 get truncated to short when it passes through various layers of functions that handle the codes. The PUAs I use are values greater than 65535.
As an example, the values of keyval and aOrigCode in the arguments of GtkSalFrame::doKeyCallback are both 97 when you type the letter 'a' on the standard keyboard. Printing the individual elements of the array pStr in CommonSalLayout::LayoutText, you see the value 97 printed here. Now change the 97 to a PUA value in doKeyCallback (e.g.: 1051531) and you see that the corresponding value printed in LayoutText is the truncated value (printed value of 2955 for 1051531). 2955 is the value that will be printed when an integer type containing 1051531 is written into a short type and printed.
I also see that uInt16 is used in many places in the code. 

At this point, I just want to understand the flow. I'm not suggesting that LO make any change. Where in the code do the key values get handled as they are typed in and where in the code do they get mapped to the value needed for displaying the glyph. I assume the value for display will be encoded in UTF-8. I'd like to know where in the source code that happens as well.

> Yup. No LO changes needed, unless you find some bug.
I'm definitely not suggesting changes, but am trying to understand the code as I explained above. However, I would also not rule out the possibility that copy-paste part of the code works well because it correctly reads the UTF-8 encoded values of the codepoints expected by the font file, while the keyboard input results in these values being incorrect as they pass through various layers of the program. I just want to know what these layers are.

> I'm not sure I understand you. Is this a Gtk-only problem, so qt5 or kf5
> works? I'm not aware of any restriction regarding file names. Sure Gtk+
> and Qt5 default to utf-8 encoding, but that should just work. Or do they
> reject PUA code points (which IMHO makes sense, because a filename has
> no font).

Not sure about other systems, but GNOME restricts to valid unicode values. It does not reject PUA but rejects 32 bit values encoded in UTF-8. I wrote my own UTF-8 encoding mechanism that would take 32 bit values but some GNOME functions fail which is why I mapped my coding system to PUAs. As far as this discussion for LO's functionality is concerned, it is only related to PUA values.

> From the filesystem POV it's all just bytes. 

This is not related to LO, but this is where many GNOME libraries impose the restriction. It does not follow the filesystem of filenames being just bytes. If you try using a g_filesystem* function and pass a filename containing a character which is not approved by the Unicode Consortium, it will fail. GNOME is not agnostic to various Standards out there but follows the Standards set by some organizations. Of course, in those cases, I just use fopen or related calls.

-a

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/libreoffice/attachments/20200321/cefc27fe/attachment.htm>