Questions regarding the shared mime spec

Mon Sep 26 01:07:45 PDT 2011

On Sun, 2011-09-25 at 09:08 +0200, David Faure wrote:
> Hi Johannes,

> [skipping a few questions about mime.cache which I don't know yet]
> > ReverseSuffixTreeNode.CHARACTER: What encoding is used? I guess UTF32?
> UTF-8, rather? I don't think this code uses UTF-32 anywhere.
> Not sure it was tested for non-ascii globs anyway.

Filenames in unix are octets with any values except 0 and '/', and do
not have a defined encoding per se. In actual use we declare some
encoding (different depending on the user) to be used in order to be
able to display the filenames, but that doesn't make the other filenames
invalid or impossible to exist.

Its possible to convert some such encoded files to utf8 and then do
matching in utf8 so that you could match a non-ascii glob, but that
would be problematic in the case where you had a file with a known ascii
suffix (say .gnumeric) but a filename that is not convertible to UTF8,
because in that case you wouldn't even be able to do a ascii match.

So, in practice I don't ever think it will work with non-ascii
characters in mime extensions or globs. So, the
ReverseSuffixTreeNode.CHARACTER characters are binary octets to match
the filename octets, but in practice only contain ascii characters.

> > [...]
> > Regarding the recommended checking order:
> > "Otherwise, start by doing a glob match of the filename."
> > In which order should LITERALs the RST and GLOBS be checked?
> > Should all 3 of those always be checked?
> > For example if a LITERAL match is found, should the RST/GLOBS still be
> > checked? (guess not)
> 
> Interestingly enough, implementations that don't use mime.cache don't 
> distinguish between these types of globs. So I think they should all be 
> checked, and then if you get more than one match, you go into the "multiple 
> globs matched" resolution (i.e. sniffing and sorting it out).

Yeah, distinguishing these is merely an optimization in the
implementation really. For all intents they are globs of a specific
form, of glob, as is the things in the suffix tree.

> > "If any of the mimetypes resulting from a glob match is equal to or a
> > subclass of the result from the magic sniffing, use this as the result."
> > Should this check be done against _all_ matching GLOBS/RST entries, or
> > against the list obtained in step 2 ("only biggest weight. If the
> > patterns are different, keep only globs with the longest pattern")
> 
> If this is about that "glob conflict" resolution, then the goal is to choose 
> which of the matched globs is best. So it should be only "against the list 
> obtained by glob matching". Otherwise you get funny results.
> The extension is still a pretty good hint, we should use it, not just rely on 
> fragile magic-only.
> 
> > "Otherwise use the result of the glob match that has the highest weight."
> > What if there are multiple, different matches with same length & weight?
> > Return "application/octet-stream" or the first match?
> 
> One of the matches, e.g. the first match.
> 
> > The spec assumes there's at most one MAGIC match. What if there are
> > multiple matches? Use the one with the highest  PRIORITY? What if there
> > are multiple matches with the same PRIORITY?
> 
> Good question. In practice my code stops at the first match. I rely on the 
> fact that the magic rules are sufficiently well written. But "the one with the 
> highest priority" seems like a safer choice indeed. And then if there are two 
> with the same priority, we have no choice but to pick one.

In cases like these its pretty common to chose the longest match. I
don't know if we do this at the moment in gnome though.

> > Another question: why do the bug-30656-xchat.conf/menu.ini tests return
> > "application/octet-stream"? Those are text files, so if text/binary
> > guessing is used the result should be "text/plain"? Is the spec out of
> > date and binary/text guessing is obsolete?
> 
> No, binary/text guessing was never implemented in xdgmime, and that's a bug 
> indeed. It turns out that I fixed it last week, see attached diff.

For historical reasons we implement the text guess in glib when using
xdgmime, we should probably move that down to xdgmime.

> I just committed it, but I'm attaching the patch so that Alexander or Bastien 
> can review it, this is my first commit to xdgmime [which seems to still be in 
> CVS? I thought everything had moved to git?]

Weird. I have an old clone here pointing at:
ssh://alexl@git.freedesktop.org/git/mime/xdgmime.git
But that doesn't seem to exist anymore??

Anyway, some issues with the patch:

The text check should be done in xdg_mime_get_mime_type_for_data, not
just  _xdg_mime_cache_get_mime_type_for_data, in case some directory
doesn't have a cache file and we fall back on using the "magic" file.

Also, our current code in glib doesn't limit itself to the first 32
chars, but looks at all the data we have read anyway. Is there a
particular reason to stop early? Performance?