mime-type/application mapping spec, take #2

Tue Jul 8 17:27:24 EEST 2003

On Mon, 2003-07-07 at 18:09, David Faure wrote:
> I just found more cases for mimetype inheritance.
> text/docbook is a special case of text/sgml, for instance.
> If you have no application that can handle docbook, then you want
> to see those that can handle sgml, since they'll work fine with a docbook file.
> The same can be said for all sgml variants, including all xml variants, etc.
> We have a real inheritance tree there.

a) text/docbook is not a registered media type.
b) There is already a known and documented method for XML which could
easily be adapted - the +xml suffix described in RFC3023.

Therefore this should be "text/x-doocbook+sgml" - which would then be
handled by an SGML parser if "text/x-docbook+sgml" itself were not
handled. New and exciting SGML types, having also this +sgml suffix,
could then be handled with reasonable fallback, too.

I'm not claiming this will always happen, just that there are known and
documented conventions for naming media types, and I'm certain we'd be
better off by using them, rather than not.

> > Media types do need some form of canonicalization before any application
> > does anything with them, hence my somewhat pie-in-the-sky suggestion of
> > a formal registry at XDG for them. 
> 
> Which is exactly what is provided in http://www.freedesktop.org/standards/shared-mime-info
> (get the tarball to see it).

Which:
a) Contains several non-IANA, non "x-" prefixed types.
b) Does not contain the full list of IANA registered types - including,
incidentally, application/vnd.kde.*
c) Contains some "x-" prefixed types which may potentially be used for a
different meaning outside of XDG compliant desktops.
d) Contains no information on how to interpret non-XDG media types, such
those form a random webserver or received via MIME from a non-XDG based
MUA.
e) Impressively contains a magic element helping me to identify unknown
files. This made me giggle, sorry. I realise it's probably meant for
executables and suchlike.

I can correct (a) and (b) easily enough from
http://www.iana.org/assignments/media-types/

The specification itself has no explanation of:
a) How to get this file.
b) How this file may be updated.
c) What media types will, and will not, be included.

> > As for SMB hosts being "like" a directory, you surely mean "can be
> > presented to the user like", since the actual access methods used are
> > wildly different.
> No, not for us at least. You click on it, it enters it, adds it to the URL,
> and asks the kioslave (VFS thingie) to list it. To the file manager it really
> is a directory.

The whole kioslave thing is to do with how to resolve a URL to get an
object. To the file manager, the object does indeed have directory
semantics, albeit possibly not quite the same as a local directory's
semantics. Directories and collections do not have media types. (DAV
collections, for instance, are simply URLs. The actions you can perform
on them allow for retreival of a list of contents, and this list has a
media type.) Of course, you're welcome to shoehorn in media types for
directories, but I suspect you're trying to solve two problems with the
same solution, which invariably makes the solution considerably more
complex. (RFC1925 2.5 - See? I can even cite April Fool's RFCs. :-))

You can have more fun than this simple case, of course: Consider the URL
"http://www.foo.int/some-stuff.tgz". This is a URL, which resolves to an
object. This object may have either directory or file semantics (both,
but not at the same time). There are two possible media types,
application/x-gzip and application/x-tar. There is no content transfer
encoding, but if it's "application/x-tar", there's a content encoding of
"gzip".

Just to complicate things, which semantics you get depends on the
implementation entirely. Which media type you get depends on how the
remote webserver is configured, which is entirely out of our control.

We can, however, take the media type and canonicalize it, such that we
always end up with the same media type.

Essentially, we have:

a) An address of some form, which can be canonicalized into
b) A URL, which can be resolved somehow into
c) An object, which may prove to have
d) content, which has a
e) Media type, which we can use to find handlers for it.

Not all URLs resolve to a media type, of course, unless you've
shoehorned that in.

It might be worth a look at the FTPEXT draft covering MLSD, since this
has to deal with objects in TVFS with multiple semantics, only some of
which have content, and therefore media types.

> > 5) Relying on anything beginning "x-" in the IETF world to stay stable
> > is asking for trouble, sorry.  Hence my suggestion of a registry - at
> > least we'd have some stability there.
> See above - we have that already.

See above - we have a limited solution.

> > But hang on... Given that there is no standard at the moment, isn't this
> > going to happen anyway to an extent? (Minor point, incidentally, I'd
> > suggested terminating the prefix with a dot, since that's how the
> > current prefixing operates within IANA.)
> ... but not what the major environments do right now. Sorry for being conservative,
> but any change here has a HUGE impact on all the existing software.

But it shouldn't do. We need, regardless, a method for taking one media
type, and translating it to one we understand, or
application/octet-stream, for actual processing. Otherwise we'll have to
enter "application/x-sh", "application/x-shell-script", "text/x-sh", and
all the other variations in seperately.

Yes, inheritance can indeed solve this problem, but so can simple
aliasing.

Given a media type that we don't know anything about, however, we need
to work purely off the name of the media type, the name of the file (if
any) and the content of the data (if we have it yet).

If we know about the media type, but we don't have the right application
available, then falling back in a known manner, via some form of
aliasing, is useful - inheritance is overkill.

You seem to be concentrating your efforts on handling content for which
we know something of the nature. If we don't, inheritance does not work
any more than simple aliasing. If we do, I don't see how it helps, other
than a case for which our naming convention for media types is wrong.

> > Agreed, Apache may well tell us an object is of a certain media type
> > which isn't a "formal" XDG type, but equally, the canonicalization
> > should catch this. By specifying a prefix to the XDG
> > standard-but-yet-not-standard media types, we can be reasonably certain
> > that we're getting what we expect.
> Or by all sticking to the list of mimetypes provided in the shared-mime-info "standard".

Great. You ask Bill Gates to stick to this standard list, and I'll go
for Apple. Anyone else want to contact every webserver administrator in
the world, and we'll need some help tackling everyone using a mailcap
file, I think. Should be finished by next Thursday. ;-)

Or else we can assume that not everyone will stick to - or even know
about - the standard-but-not-standard list we promote, and try to find
methods for handling such cases.

> > [rather abstract stuff about Semantics snipped]
> 
> > B) Media types
> > 
> > 1) We need some method for canonicalization of existing MIME media
> > types, such that all XDG conformant environments agree on the same set
> > of media types, modulo environment specific types.
> See shared-mime-info.

We need a method of canonicalization, not just a list of canonical
types. So we need aliasing, basically, to tell us that
"application/x-this-might-be-a-bourne-shell-script-but-it-might-be-bash"
is probably worth treating as "text/x-sh"; "text/plain;
charset=I-Invented-This-Charset-This-Morning" as
"application/octet-stream", etc.

> > 3) The agreed set of non-IANA media types should be held within some
> > form of registry.
> >  - Which we may have, however, I'm not sure from the specification.
> We do have.

We've got a file, not a registry as such. It needs more definition WRT
how this file changes over time, and how it can be changed, and who
changes it, and why. Also being able to have a single URL, rather than
"get this tarball and unpack it, we've hidden it in there.", would be
nice.

> > 4) The agreed set of non-IANA media types should be prefixed to avoid
> > potential collisions.
> >  - I still like this idea. :-)
> And I don't like the idea of breaking everything currently done by KDE, Gnome
> Apache and more (in fact almost everything). This makes no practical sense,
> only theoretical sense.

Apache isn't going to be affected by this at all. Why should it be? Are
they going to follow our standard?

All it means is that *if* an application sees a media type of
"application/x-xdg.foo" then it can be reasonably sure it really is a
Foo file. Nothing more. Obviously standard types (IANA registered) don't
get this treatment, since that *would* affect lots of other things.

Having said that, it is reasonable to assume that there are, in effect,
certain x- prefixed media types which we need to treat as de-facto
standards, so maybe we restrict ourselves to new ones, those for which
we have a known collision, and those for which the de-facto media type
is demonstrably wrong.

> > 5) Where a specific subtype is not known to the system, the system may
> > choose a default based on the top level type, if one is defined.
> I would prefer the much more fine-grained approach of mimetype inheritance,
> as outlined above.
> There's no guarantee that your "audio player" can handle audio/newtype,
> so giving it audio/* doesn't sound too good (no pun intended :).

I agree, but it's still a reasonable offer to make to the user.

"This file is an unknown kind of audio. Shall I try playing it with
{X|G|K}{[multi]media|audio}{player|system}?"

> On the other hand, giving text/docbook to a text/sgml application, or to
> a text/plain application (text/sgml inheriting from text/plain), is much safer.
> 
> > A "text/*" type with an unknown charset has to be treated as
> > "application/octet-stream" by the system. RFC2046, 4.1.4.
> OK. Hopefully this is a very very rare case though :) We know about a LOT
> of charsets in Qt/KDE, at least. I've never seen this problem happen.

It's almost certainly a rare case, I'd have thought that the charset
handling in almost any desktop app is pretty good these days. But still,
new charsets might be invented, and someone might spell a charset name
wrong somewhere, or the user might be perverse enough to try handling an
odd 8-bit charset text file through some legacy (A technical term for "I
think it's crummy") app, and screw up the terminal window it runs in.

Personally, I've always been a big fan of seeing xtermxtermxtermxterm
repeated over and over again, but I understand that some people, not
realising the expressive nature of the free-form poetry that xterm
implementations seem only too ready to produce given binary data, find
it somewhat confusing.

Dave.