Shared-mime checking order

Fri Aug 24 05:49:41 PDT 2007

On Thu, 2007-08-23 at 01:27 +0200, David Faure wrote:
> On Friday 27 July 2007, Sanel Zukan wrote:
> > Thank you for replies.
> > 
> > > Yeah, I also found that too, when checking my chemical MIME types list.
> > > Seems, priorities of "50" are enough for magic patterns. Should the spec
> > > be adjusted? What do you (people in general) think about this? I mean,
> > > the spec was written to have a standardized way to handle things. That
> > > doesn't mean, that things cannot be improved :) So is it time to update
> > > the spec? I would really like to see GNOME and KDE [1] (and other like
> > > rox-filer, ...) detecting the file types with the same success (of
> > > course, there are some false positives with the way of GNOME's
> > > implementation too - so there is place for improvement :)).
> > 
> > Yes; I'm also very interested to see unified detection, even if that
> > detection for corner cases shows to be wrong.
> > [...]
> > > BCCing David Faure
> > 
> > Thanks; it would be really nice to see other implementations too :-)
> 
> The KDE 4 implementation follows the spec as much as possible, i.e. the algorithm (in KMimeType::findByUrl) is roughly
> 1) find from mode_t if set (leads to inode/*)
> 2) try high-priority (>80) magic rules for local files
> 3) try to find out by looking at the extension if any [except on protocols were extensions are unreliable like HTTP]
> 4) try low-priority magic rules for local files,
> 5) otherwise use protocol-based heuristics for some protocols (e.g. kde's "man:" is always HTML, or 
> for protocols that allow listing directories like FTP or FISH, a url which ends with '/' is an inode/directory, etc.)

Really? Since there are rules with priority > 80 this means you always
have to load the first block of the file when detecting mimetype. This
is awfully slow, since seek times on disks are bad, and are not getting
any better.

I don't think this is realistic for e.g. a file manager. Its just too
slow. A solution that mainly looks at extensions, but that then tries to
sniff for "problematic" (an uknown/missing) extensions could work, but
not sniffing all files.

What sort of performance do you see for sniffing a large directory of
files?

> There's also a "fast mode" for that code to disable magic matching and only use 1), 3) and 5).

Thats useful, but how do you expose these different types to the user?
Since they could affect the application behaviour.

Of course, at times you only have the filename availible, so then you
can't do better.

> So, I like it as it is, at the moment.
> The only thing I'm missing is a "native extension" for each mimetype, i.e. which extension to
> suggest when saving with a given mimetype. I suppose I could pick the first one but order
> doesn't matter currently, and also there's the case where we shouldn't mention extensions
> for matching (see below). So I would like an explicit "preferred extension" for each mimetype
> (but if there's exactly one glob then it can explicitely be parsed as preferred extension,
> to avoid redundancy in the simple case).

That would be nice.

> I think the freedesktop xml file should make sure that this doesn't happen.
> If an extension can be used for two kinds of files (e.g. *.rpm) then the rules shouldn't
> mention the extension at all, or at least there should be high-priority magic rules to
> detect such files by magic (but I think skipping the extensions and using low-prio magic
> is a better idea, since it's more efficient in that it's less often done).
> Otherwise if we just have extensions, it's just useless, we wouldn't know which mimetype to pick from the two.

A typical example is *.pcf, which is both a font type and a cisco vpn
description file. The later isn't currently in the xdg shared mime db,
but we have a patch in fedora. I don't see how leaving the extension
knowledge out of the db is any better than a conflict. It just means we
know less, but we can still make a decision in the same way.

I don't belive the efficiency argument here. An extension match is a
pure cpu thing on the order of nanoseconds. A sniffing is an i/o thing
on the order of milliseconds (or much more if your disks are busy),
espcially with slow laptop drives. Thats many thousands of times slower.

> > Gnome currently doesn't look at the priorities at all I believe.
> Ouch. Is it planned to change that? Non-standard behavior defeats the purpose of a standard :)

Well, we do look at it when sniffing (i.e. a higher prio magic matches
before a lower prio), but not when deciding sniffing vs extension
mapping. This is since we want to avoid the slow sniffing as much as
possible. To do the comparison we'd have to sniff to see which magic
rule (if any) matched.