text/xml vs application/xml
Dave Cridland
dave at cridland.net
Fri Dec 2 01:48:17 EET 2005
On Thu Dec 1 20:53:17 2005, Christian Rose wrote:
> On 12/1/05, Kevin Krammer <kevin.krammer at gmx.at> wrote:
> > Isn't an XML file considered to be in ASCII unless a different
> enconding is
> > specified by the processing instruction?
>
> Not really. Unless other information is given, AFAIK an XML file is
> to
> be assumed to be in UTF-8.
> Quote from http://www.w3.org/TR/REC-xml/#charencoding :
>
> "In the absence of information provided by an external transport
> protocol (e.g. HTTP or MIME),
Right. So what you're then asking is, "does HTTP or MIME provide a
default?", because that is still information.
RFC3023 states that the default for MIME is US-ASCII, and explicitly
defines the default for HTTP as US-ASCII, overriding HTTP's usual
default (for text/*) of ISO-8859-1. I say "implies" a default for
MIME (thus email), because I don't actually see a specified default
in RFC2046 for anything except text/plain, but RFC3023 appears to
reference that default. (It's late, I might well have missed
RFC2046's default, but I did look reasonably hard, as I wanted to
quote the text.)
So Kevin's right - if, of course, you got the opportunity for a
charset parameter, but didn't get one. If you didn't, then REC-xml
takes over.
> As a consequence, a file containing only ASCII characters but no
> encoding information would be valid XML. But *assuming* that any
> file
> without encoding information will be valid ASCII is plain wrong.
> Valid
> ASCII is always valid UTF-8, but not necessarily the other way
> around.
Yes, this is true. The problem being that this would only be true for
a file held on a simple filesystem with no ability to provide a
content-type. If you *do* have a MIME content-type field, then by
default you have US-ASCII, since the optional charset identifier
still tells you that even when absent.
In other words, a file on a traditional filesystem which indicates
(via extension, etc) text/xml has to be treated using REC-xml Section
4.3.3, which you quoted, but one retrieved via a VFS system has to be
assumed to be US-ASCII.
Rejoice, because this is better than text/plain, which changes
default character sets depending on whether you got it from email,
web, or local disk.
But wait, because it's about to go horribly wrong. :-)
The type system that most desktops use, whether using the
freedesktop.org specification or not, uses only media types, not the
full content-type. So does this mean that we're really using MIME on
a local filesystem (we get a media-type, after all, so we assume all
optional parameters are absent), or does this really mean it isn't
MIME, and merely shares a subset of the syntax. Because that in turn
changes the default character set for text/xml, depending on your
reading of RFC2046 and RFC3023.
application/xml is joyously unaffected by this - if no character set
is specified, then you fallback with Section 4.3.3 of REC-xml,
however you got it, and which says that you either have a BOM, or use
UTF-8, or provide a (presumably ASCII compatible) encoding. (So not
*quite* UTF-8 as a default).
Dave.
--
You see things; and you say "Why?"
But I dream things that never were; and I say "Why not?"
- George Bernard Shaw
More information about the xdg
mailing list