Menu spec: .desktop filename encoding issue

Mon Jul 25 18:14:00 EEST 2005

On Wed, 2005-07-20 at 17:03 +0100, Mark McLoughlin wrote:

> 	If a .desktop filename is encoded using some unknown charset[1] then
> the filename is essentially junk (i.e. you can't reliably convert it to
> a known encoding).
> 
> 	This could cause problems if you later try to use filename as a
> desktop-file-id in a .menu file since the filename would not be valid in
> the encoding of the .menu file.
> 
> 	I guess an implementation has two choices:

	The approach I first took was that desktop file IDs must always be
UTF-8. That entailed converting filenames from whatever encoding they
were in to UTF-8 and ignoring any files whose filename encoding we
couldn't decipher.

	There's at least two problems with this approach, though:

  1) Whether we could recognise the filename encoding of a given 
     file would essentially be dependant on the locale. So, in one 
     locale you might see a given entry in your menus, but not in 
     another locale.

  2) Two distinct files whose filenames are encoded differently could 
     have the same desktop file ID.

	So, I think we need to look more carefully at this option:

>   1) When writing .menu files, escape the desktop-file-id such that it
>      is valid in the encoding of the file and unescape it when parsing.

	What we would need to be able to do is escape desktop file IDs so that
they are valid according to the encoding of the .menu file, while still
retaining their original encoding.

	One option is that we go with a simple variation of rfc2396's URI
escaped encoding and simply say that any octet sequence that is not
valid in the .menu file's encoding should be escaped as the percent
symbol followed by two hex digits representing the octet value.

	Backward compatibility wouldn't seem to be a huge issue - e.g. an old
implementation reading a .menu file using this escaped encoding will
simply fail to find the file (because it didn't unescape the name) or
find a file whose name matches the unescaped name. This is better than
the current situation of being unable to represent the filename at all
in the .menu file.

	e.g. if the filename is foo€foo.desktop encoded in ISO8859-15, then
the .menu file would include:

  <Filename>foo%a4foo.desktop</Filename>

	so an older implementation will try to match "foo%a4foo.desktop" and,
most likely, fail.

	Alternatively, a .menu file written by an older implementation may
contain desktop file IDs which contain the percent character. These
desktop file IDs would be mis-interpreted by newer implementations.

	The reasoning behind using our own simple escaped encoding rather than
fully implementing the rfc2396 encoding would be that our requirements
are less stringent - e.g. there is no need for us to escape space as %20
- and going with the full rfc2396 encoding would just exasperate any
backward compatibility issues.

	One issue I see here is that if one takes a menu file and converts it
from one encoding to another, you would be required to unescape the
filenames in the original encoding and then escape them in the new
encoding as part of the conversion process.

	Appended is a suggested modification to the spec. The fact that this is
all so complex makes me think I'm missing something very obvious here,
though.

Cheers,
Mark.

Index: menu-spec.xml
===================================================================
RCS file: /cvs/menus/menu-spec/menu-spec.xml,v
retrieving revision 1.27
diff -u -p -r1.27 menu-spec.xml

--- menu-spec.xml	13 Apr 2005 13:32:12 -0000	1.27
+++ menu-spec.xml	25 Jul 2005 15:12:50 -0000
@@ -934,6 +934,36 @@ entries</ulink>: <varname>Categories</va
         </variablelist>
       </para>
     </sect2>
+    <sect2 id="menu-file-filename-escaping">
+      <title>Filename Escaping</title>
+      <para>
+       The contents of the &lt;AppDir&gt;, &lt;DirectoryDir&gt;, &lt;Filename&gt;, &lt;Directory&gt;,
+       &lt;MergeFile&gt;, &lt;MergeDir&gt; and &lt;LegacyDir&gt; elements identify specific files or
+       directories on the filesystem. Filenames which are not encoded in the same encoding as the menu
+       file must be escaped according to some simple rules:
+       <orderedlist>
+        <listitem>
+         <para>
+          A percent character, if it is in the same encoding as the menu file, is represented by the
+          sequence of characters "%25".
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          Any character or octet sequence which is not valid under the encoding of the menu file
+          is encoded as a series of octets, each octet represented by a percent character followed 
+          two hexidecimal digits representing the value of the octet.
+         </para>
+        </listitem>
+       </orderedlist>
+      </para>
+      <para>
+        When parsing the menu file, implementations should unescape filenames using the reverse
+        of these rules. However, if any percent character is not followed by two hexidecimal
+        digits, no un-escaping of the filename should occur in order to preserve compatibility
+        with previous versions of this specification.
+      </para>
+    </sect2>
   </sect1>
 
   <sect1 id="merge-algorithm">