recent-file-spec: possible design flow ?

Mon Jul 14 17:08:27 EEST 2003

Waldo Bastian wrote:

>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>On Monday 14 July 2003 14:51, Oliver Braun wrote:
>  
>
>>Hi *,
>>
>>we - the SUN team working on OpenOffice.org - noticed a possible design
>>flow in the recent-file-spec (or at least in the Gnome 2.2
>>implementation of it) when looking at the Ximian patches for
>>OpenOffice.org:
>>
>>it seems that Gnome 2.2 converts the full local file path to utf-8
>>before encoding the result as file url. It uses the text encoding
>>matching the current locale as "from" encoding. This is not reversable
>>if the path contains bytes that are not valid characters in this
>>encoding (multi encoding paths) !
>>
>>The result will be that the application launched by the panel will not
>>be able to open such a file when chosen by the user from the "Open
>>Recent" menu. Unfortunatly we made the same mistake in OpenOffice.org
>>1.x :(. The only way to handle multi encoding paths correctly seems to
>>be to encode the byte sequence as returned by the file system layer.
>>
>>The recent file spec says <QUOTE> All text in the file should be stored
>>in the UTF-8 encoding.</QUOTE>, which IMHO can easily (mis- ?)
>>understood as "convert file names to utf-8".
>>
>>How does KDE expect file urls to be encoded ?
>>    
>>
>
>I'm not aware of any recent-file-spec or KDE implementing it, but in general 
>KDE converts filenames from locale-encoding to 16-bit unicode which is used 
>internally, and typically stored on disk as utf-8.
>
>URL's are handled slightly different, when storing filenames as URL's, they 
>are re-encoded using the locale-encoding and then the non-ascii part is 
>%-encoded. That results in a URL that consists of ASCII-chars only and the 
>octets of the URL match 1:1 with the octets of the original filename 
>(assuming decoding/encoding with the locale-encoding is reversable)
>
>We did identify a problem when using utf-8 as locale-encoding. If a filename 
>is not a valid utf-8 sequence then decoding/encoding such filename will 
>change it. We intent to fix that by recording the invalid-utf8 sequence in 
>the 16-bit unicode string so that we can still convert it back to the 
>original sequence when converting back to "utf-8" (It will not be valid 
>utf-8)
>
>Are you aware of other encodings than utf-8 where this might be a problem?
>
Basically you can run into this problem with any encoding: let say a 
chinese user creates a file in zh.BIG5 locale that contains non ascii 
characters. Later (s)he logs in with "C" locale. In this case, the 
conversion from locale-encoding to utf-8/utf-16 may  produce some "?" if 
the byte sequence being the file name is not valid in that encoding.

Latest when it comes to file urls stored on disk in one locale, but read 
in another, the conversion from utf-8/utf-16 to the locale encoding will 
fail and the resulting file name does not match the one on disk.

You can even try to save a file named with german umlauts, store the url 
and try to convert back in "C" locale and the conversion will fail.

Luckily, users don't do such things very often - or don't expect it to 
work ..

- Oliver

>
>Cheers,
>Waldo
>- -- 
>bastian at kde.org -=|[ SuSE, The Linux Desktop Experts ]|=- bastian at suse.com
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.0.6 (GNU/Linux)
>Comment: For info see http://www.gnupg.org
>
>iD8DBQE/ErFFN4pvrENfboIRAupGAJkBGFpB16LhSxaQSA74TZXia/C6WwCbBA31
>OWgDG7K3nYnh88dYKm2LD1M=
>=C1Su
>-----END PGP SIGNATURE-----
>  
>