Seeking clarification of Desktop Entry Specification
Bollinger, John C
John.Bollinger at STJUDE.ORG
Fri Apr 24 15:00:35 UTC 2020
(1) #basic-format says "A file is interpreted as a series of lines that
are separated by linefeed characters." #value-types says "The escape
sequences \s, \n, \t, \r, and \\ are supported for values of type string
and localestring, meaning ASCII space, newline, tab, carriage return,
and backslash, respectively." Even though the former is about the
structure of the file itself while the latter is about the encoded
payload, it is confusing that one talks about "linefeed" while the other
talks about "newline" and "carriage return". Should "newline" read
"linefeed" (meaning U+000A LINE FEED) instead?
When referring to ASCII characters, the names "newline" and "line feed" are synonymous. Both names refer to the character with ASCII code 10 (decimal). Additionally, both names convey the same idea for the action of a line-printer type output device: the print head is advanced to the next line (or, more accurately, the paper is scrolled one line forward). I would be inclined to say that no, the escape sequence '\n' should not be described as referring to a line "feed instead" of a "newline", because '\n' is mnemonic for "newline", and that is the prevailing terminology from C and C-influenced languages, from whence "\n" comes.
On the other hand, I take the words "linefeed character" in the description of the format to be chosen intentionally to avoid ambiguity about line termination. Desktop files are specified to consist of lines separated by exactly one linefeed character, regardless of whether that matches the standard line-termination semantics for text files on the host system. This word choice thus minimizes confusion that might otherwise arise around the fact that C and C++, when reading or writing a in text mode, automatically translate between newline (linefeed) characters internally and whatever line termination is locally appropriate externally. The design choice results in the interpretation of .desktop files being insensitive to the conventions of the host environment.
Whereas I agree that the disparity in word choice is potentially confusing, I do not think that the wording should be changed in either place. Possibly, however, there is room for a clarifying comment.
(2) #entries says "Space before and after the equals sign should be
ignored". Does that mean just U+0020 SPACE, or also other kinds of
white space, like U+0009 CHARACTER TABULATION?
Inasmuch as the wording says simply "space", and not "space characters", I take it to be inclusive of any sequences of U+0020 and / or U+0009 characters. Neither of these may appear in keys, in any case, so the alternative to accepting them both as constituents of "space" is to reject files that use any tabs between key and "=". But I do agree that this is ambiguous and should be cleared up. In particular, since desktop files are encoded in UTF-8, they can also contain any of the relatively many other characters that Unicode categorizes as space characters, and it is unclear whether it is intended that the "space" around '=' signs be inclusive of all these. I suspect that implementations generally accept only U+0020 and U+0009 as "space" in this sense, and maybe U+000D, but it is hard to justify that specific choice from the wording.
(3) It is unclear exactly when the escape sequences mentioned in (1)
need to be used in string/localestring values:
* "\\" apparently needs to be used at least whenever the following
character is one of "s", "n", "t", "r", or "\". But what about
sequences like "\a", does it render the file ill-formed, or is it an
accepted shortcut for the fully escaped "\\a"?
That is a fair question, and one whose answer I agree is ambiguous. I'm would be inclined to guess that there is a diversity of implementation. It is clear that "\a" is not a recognized escape sequence, as it is not included in the enumeration of those, so what is it? Absent an update to the spec, I would be inclined to say that authors should avoid writing such combinations, and processors should interpret any they encounter as if they were "\\a".
Note also that there are special rules for an additional level of quoting for values of "Exec" keys.
* "\n" apparently needs to always be used (at least with the "newline"
vs. "linefeed" clarification from (1)).
If you want a newline (linefeed) character in a value then you must represent it as "\n", because a literal newline would terminate the value.
* "\s" (and maybe also "\t" and "\r"?) apparently needs to be used at
the very start of a string/localestring value (see (2)). But does it
also need to be used e.g. at the very end of such a value? (From common
practice, it appears that it at least doesn't need to be used for a
space somewhere in the middle of such a value.)
* What about "\t" and "\r"?
"\s" is definitely needed, and I would argue "\t", too, if spaces or tabs are wanted in a value before the first non-"space" character. On the other hand, values end at the end of the line, so a strict reading of the specification does not suggest that spaces and tabs need to be escaped there, but an author who wants those at the end would be wise to use the escapes, as implementations may vary. But for their part, implementations should include trailing space and tab characters in values. Spaces and tabs appearing other than at the beginning and end of a value do not need to be escaped, and for legibility they should not be. Any implementation that interprets literal spaces or tabs in the interior of a string or localestring value as anything other than themselves is definitely erroneous, and to my knowledge, there are no such implementations of any significance to be concerned about.
As for "\r". The spec seems clear that for the purposes of .desktop files, literal carriage-return characters are not line terminators in their own right, and they do not combine with linefeed characters to serve as line terminators. They have no special significance, unless -- and this is ambiguous -- as characters that can be part of the "space" around equals signs separating keys from their values. Authors would be wise to use the escape sequence instead of a literal carriage-return character everywhere that they want to represent a carriage return in a value, and especially at the beginning and end. Doing so will avoid the potential for misinterpretation by implementations that do not fully conform, and will avoid a variety of accompanying display quirks for people viewing or editing the file with a text editor. I would suggest that implementations include literal carriage-returns in the "space" that may appear around "=" signs. Anywhere else in a value, they should be treated as ordinary characters (this is the same treatment as I interpret is specified for literal spaces and tabs).
Overall: .desktop files should be interpreted as binary files with a human-readable-ish, text-like format. They are not text files, and should not be interpreted as such, even on systems where their character encoding and line-termination semantics happen to coincide with the local convention for text files.
--
John
________________________________
Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.freedesktop.org/archives/xdg/attachments/20200424/dc7a315f/attachment-0001.htm>
More information about the xdg
mailing list