text/xml vs application/xml

Dave Cridland dave at cridland.net
Fri Dec 2 01:48:17 EET 2005


On Thu Dec  1 20:53:17 2005, Christian Rose wrote:
> On 12/1/05, Kevin Krammer <kevin.krammer at gmx.at> wrote:
> > Isn't an XML file considered to be in ASCII unless a different 
> enconding is
> > specified by the processing instruction?
> 
> Not really. Unless other information is given, AFAIK an XML file is 
> to
> be assumed to be in UTF-8.
> Quote from http://www.w3.org/TR/REC-xml/#charencoding :
> 
> "In the absence of information provided by an external transport
> protocol (e.g. HTTP or MIME),

Right. So what you're then asking is, "does HTTP or MIME provide a 
default?", because that is still information.

RFC3023 states that the default for MIME is US-ASCII, and explicitly 
defines the default for HTTP as US-ASCII, overriding HTTP's usual 
default (for text/*) of ISO-8859-1. I say "implies" a default for 
MIME (thus email), because I don't actually see a specified default 
in RFC2046 for anything except text/plain, but RFC3023 appears to 
reference that default. (It's late, I might well have missed 
RFC2046's default, but I did look reasonably hard, as I wanted to 
quote the text.)

So Kevin's right - if, of course, you got the opportunity for a 
charset parameter, but didn't get one. If you didn't, then REC-xml 
takes over.

> As a consequence, a file containing only ASCII characters but no
> encoding information would be valid XML. But *assuming* that any 
> file
> without encoding information will be valid ASCII is plain wrong. 
> Valid
> ASCII is always valid UTF-8, but not necessarily the other way 
> around.

Yes, this is true. The problem being that this would only be true for 
a file held on a simple filesystem with no ability to provide a 
content-type. If you *do* have a MIME content-type field, then by 
default you have US-ASCII, since the optional charset identifier 
still tells you that even when absent.

In other words, a file on a traditional filesystem which indicates 
(via extension, etc) text/xml has to be treated using REC-xml Section 
4.3.3, which you quoted, but one retrieved via a VFS system has to be 
assumed to be US-ASCII.

Rejoice, because this is better than text/plain, which changes 
default character sets depending on whether you got it from email, 
web, or local disk.

But wait, because it's about to go horribly wrong. :-)

The type system that most desktops use, whether using the 
freedesktop.org specification or not, uses only media types, not the 
full content-type. So does this mean that we're really using MIME on 
a local filesystem (we get a media-type, after all, so we assume all 
optional parameters are absent), or does this really mean it isn't 
MIME, and merely shares a subset of the syntax. Because that in turn 
changes the default character set for text/xml, depending on your 
reading of RFC2046 and RFC3023.

application/xml is joyously unaffected by this - if no character set 
is specified, then you fallback with Section 4.3.3 of REC-xml, 
however you got it, and which says that you either have a BOM, or use 
UTF-8, or provide a (presumably ASCII compatible) encoding. (So not 
*quite* UTF-8 as a default).

Dave.
-- 
           You see things; and you say "Why?"
   But I dream things that never were; and I say "Why not?"
    - George Bernard Shaw



More information about the xdg mailing list