Questions regarding the shared mime spec

David Faure faure at kde.org
Sun Sep 25 00:08:59 PDT 2011


Hi Johannes,

Many questions in a single email :-)

On Tuesday 09 August 2011 18:54:34 Johannes Pfau wrote:
> The LiteralList:
> Is it safe to assume that the list is consistent?
> I.e. could there be a case insensitive "Hello" entry and another,
> conflicting, case insensitive "hello" entry?
> Can the list contain two entries which are only different in casing, one
> case-sensitive, one not? I.e. can the list contain a case sensitive
> "Hello" entry and a case insensitive "hello" entry? If so should "Hello"
> match the case sensitive variant ore the case insensitive one?

That would be ambiguous, so I would say we should never get into such a case.
We can either solve it with a different weight for the case-sensitive match,
or using case sensitivify for both matches (like we do for *.C and *.c)

[skipping a few questions about mime.cache which I don't know yet]
> ReverseSuffixTreeNode.CHARACTER: What encoding is used? I guess UTF32?
UTF-8, rather? I don't think this code uses UTF-32 anywhere.
Not sure it was tested for non-ascii globs anyway.

> [...]
> Regarding the recommended checking order:
> "Otherwise, start by doing a glob match of the filename."
> In which order should LITERALs the RST and GLOBS be checked?
> Should all 3 of those always be checked?
> For example if a LITERAL match is found, should the RST/GLOBS still be
> checked? (guess not)

Interestingly enough, implementations that don't use mime.cache don't 
distinguish between these types of globs. So I think they should all be 
checked, and then if you get more than one match, you go into the "multiple 
globs matched" resolution (i.e. sniffing and sorting it out).

> "If any of the mimetypes resulting from a glob match is equal to or a
> subclass of the result from the magic sniffing, use this as the result."
> Should this check be done against _all_ matching GLOBS/RST entries, or
> against the list obtained in step 2 ("only biggest weight. If the
> patterns are different, keep only globs with the longest pattern")

If this is about that "glob conflict" resolution, then the goal is to choose 
which of the matched globs is best. So it should be only "against the list 
obtained by glob matching". Otherwise you get funny results.
The extension is still a pretty good hint, we should use it, not just rely on 
fragile magic-only.

> "Otherwise use the result of the glob match that has the highest weight."
> What if there are multiple, different matches with same length & weight?
> Return "application/octet-stream" or the first match?

One of the matches, e.g. the first match.

> The spec assumes there's at most one MAGIC match. What if there are
> multiple matches? Use the one with the highest  PRIORITY? What if there
> are multiple matches with the same PRIORITY?

Good question. In practice my code stops at the first match. I rely on the 
fact that the magic rules are sufficiently well written. But "the one with the 
highest priority" seems like a safer choice indeed. And then if there are two 
with the same priority, we have no choice but to pick one.

> I also had some issues with the test suite:
> 
> My "by-Name" implementation fails for the following files:
> test-template.dot, aportis.pdb, sqlite2.kexi, subtitle-microdvd.sub,
> simple-obj-c.m, linguist.ts, test.ogg

Yep, mine too.

> There's a common pattern with all those tests:  My implementation finds
> multiple equal matches in the tree and bails out according to the spec:
> "If a matching pattern is provided by two or more MIME types,
> applications SHOULD not rely on one of them. They are instead supposed
> to use magic data (see below)"
> 
> Example for test-template.dot:
> Tree: '[{Type: 'application/msword-template' Weight: '50' CaseSensitive:
> 'false' Flags: '[0,0,0]'},{Type: 'text/vnd.graphviz' Weight: '50'
> CaseSensitive: 'false' Flags: '[0,0,0]'}]' Expected:
> 'application/msword-template'
> The test suite always assumes the first of those results is returned?
> Which implementation is correct in this case? I see no reason why
> 'application/msword-template' should be used here, 'text/vnd.graphviz'
> has exactly the same weight, flags and matching pattern.

I completely agree. I think the test description should be rewritten with 
another syntax, in order to be able to express:
test-template.dot:
  match-by-name = application/msword-template or text/vnd.graphviz
  match-by-data = application/x-ole-storage
  match-by-file = application/msword-template

[See, a good example of why the glob-conflict resolution should choose one of 
the matched globs, instead of just returning x-ole-storage]

Such a syntax would also make the test output much clearer, rather than 110 
expected failures from the current "oxx" solution.

> Another question: why do the bug-30656-xchat.conf/menu.ini tests return
> "application/octet-stream"? Those are text files, so if text/binary
> guessing is used the result should be "text/plain"? Is the spec out of
> date and binary/text guessing is obsolete?

No, binary/text guessing was never implemented in xdgmime, and that's a bug 
indeed. It turns out that I fixed it last week, see attached diff.
I just committed it, but I'm attaching the patch so that Alexander or Bastien 
can review it, this is my first commit to xdgmime [which seems to still be in 
CVS? I thought everything had moved to git?]

My patch also implements "a file with size zero and no known extension has 
mimetype application/x-zerosize", as was the intent of the x-zerosize 
mimetype, but it was never explicitely written in the spec or in that code.

> [skipped questions about byte order, would need further research]

-- 
David Faure, faure at kde.org, http://www.davidfaure.fr
Sponsored by Nokia to work on KDE, incl. Konqueror (http://www.konqueror.org).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mydiff
Type: text/x-patch
Size: 3696 bytes
Desc: not available
URL: <http://lists.freedesktop.org/archives/xdg/attachments/20110925/1dc287a2/attachment.bin>


More information about the xdg mailing list