Questions regarding the shared mime spec

Sat Oct 1 08:08:43 PDT 2011

Am 25.09.2011 09:08, schrieb David Faure:
> Hi Johannes,
>
> Many questions in a single email :-)
>
> On Tuesday 09 August 2011 18:54:34 Johannes Pfau wrote:
>> The LiteralList:
>> Is it safe to assume that the list is consistent?
>> I.e. could there be a case insensitive "Hello" entry and another,
>> conflicting, case insensitive "hello" entry?
>> Can the list contain two entries which are only different in casing, one
>> case-sensitive, one not? I.e. can the list contain a case sensitive
>> "Hello" entry and a case insensitive "hello" entry? If so should "Hello"
>> match the case sensitive variant ore the case insensitive one?
> That would be ambiguous, so I would say we should never get into such a case.
> We can either solve it with a different weight for the case-sensitive match,
> or using case sensitivify for both matches (like we do for *.C and *.c)
>
> [skipping a few questions about mime.cache which I don't know yet]
>> ReverseSuffixTreeNode.CHARACTER: What encoding is used? I guess UTF32?
> UTF-8, rather? I don't think this code uses UTF-32 anywhere.
> Not sure it was tested for non-ascii globs anyway.
>
OK, I assumed it was UTF32 cause it's a CARD32.
>> [...]
>> Regarding the recommended checking order:
>> "Otherwise, start by doing a glob match of the filename."
>> In which order should LITERALs the RST and GLOBS be checked?
>> Should all 3 of those always be checked?
>> For example if a LITERAL match is found, should the RST/GLOBS still be
>> checked? (guess not)
> Interestingly enough, implementations that don't use mime.cache don't 
> distinguish between these types of globs. So I think they should all be 
> checked, and then if you get more than one match, you go into the "multiple 
> globs matched" resolution (i.e. sniffing and sorting it out).
>
>> "If any of the mimetypes resulting from a glob match is equal to or a
>> subclass of the result from the magic sniffing, use this as the result."
>> Should this check be done against _all_ matching GLOBS/RST entries, or
>> against the list obtained in step 2 ("only biggest weight. If the
>> patterns are different, keep only globs with the longest pattern")
> If this is about that "glob conflict" resolution, then the goal is to choose 
> which of the matched globs is best. So it should be only "against the list 
> obtained by glob matching". Otherwise you get funny results.
> The extension is still a pretty good hint, we should use it, not just rely on 
> fragile magic-only.
>
>> "Otherwise use the result of the glob match that has the highest weight."
>> What if there are multiple, different matches with same length & weight?
>> Return "application/octet-stream" or the first match?
> One of the matches, e.g. the first match.
>
>> The spec assumes there's at most one MAGIC match. What if there are
>> multiple matches? Use the one with the highest  PRIORITY? What if there
>> are multiple matches with the same PRIORITY?
> Good question. In practice my code stops at the first match. I rely on the 
> fact that the magic rules are sufficiently well written. But "the one with the 
> highest priority" seems like a safer choice indeed. And then if there are two 
> with the same priority, we have no choice but to pick one.
>
>> I also had some issues with the test suite:
>>
>> My "by-Name" implementation fails for the following files:
>> test-template.dot, aportis.pdb, sqlite2.kexi, subtitle-microdvd.sub,
>> simple-obj-c.m, linguist.ts, test.ogg
> Yep, mine too.
>
>> There's a common pattern with all those tests:  My implementation finds
>> multiple equal matches in the tree and bails out according to the spec:
>> "If a matching pattern is provided by two or more MIME types,
>> applications SHOULD not rely on one of them. They are instead supposed
>> to use magic data (see below)"
>>
>> Example for test-template.dot:
>> Tree: '[{Type: 'application/msword-template' Weight: '50' CaseSensitive:
>> 'false' Flags: '[0,0,0]'},{Type: 'text/vnd.graphviz' Weight: '50'
>> CaseSensitive: 'false' Flags: '[0,0,0]'}]' Expected:
>> 'application/msword-template'
>> The test suite always assumes the first of those results is returned?
>> Which implementation is correct in this case? I see no reason why
>> 'application/msword-template' should be used here, 'text/vnd.graphviz'
>> has exactly the same weight, flags and matching pattern.
> I completely agree. I think the test description should be rewritten with 
> another syntax, in order to be able to express:
> test-template.dot:
>   match-by-name = application/msword-template or text/vnd.graphviz
>   match-by-data = application/x-ole-storage
>   match-by-file = application/msword-template
Maybe the testsuite has a different understanding of match-by-name than
me: My library allows to explicitly do the matching by filename
only (useful when a file doesn't exist (yet)). I think in this case
match-by-name should signal an error / return an empty string as the
result is
ambiguous? In the match-by-file implementation obviously both results
from match-by-name should be verified with magic checking as said in the
spec.
> [See, a good example of why the glob-conflict resolution should choose one of 
> the matched globs, instead of just returning x-ole-storage]
>
> Such a syntax would also make the test output much clearer, rather than 110 
> expected failures from the current "oxx" solution.
>
>> Another question: why do the bug-30656-xchat.conf/menu.ini tests return
>> "application/octet-stream"? Those are text files, so if text/binary
>> guessing is used the result should be "text/plain"? Is the spec out of
>> date and binary/text guessing is obsolete?
> No, binary/text guessing was never implemented in xdgmime, and that's a bug 
> indeed. It turns out that I fixed it last week, see attached diff.
> I just committed it, but I'm attaching the patch so that Alexander or Bastien 
> can review it, this is my first commit to xdgmime [which seems to still be in 
> CVS? I thought everything had moved to git?]
Ok, thanks. I also check for UTF BOMs in my isTextFile implementation.
Do you think that's an useful addition?
> My patch also implements "a file with size zero and no known extension has 
> mimetype application/x-zerosize", as was the intent of the x-zerosize 
> mimetype, but it was never explicitely written in the spec or in that code.
>
Good to know, I now implemented this as well :-)
>> [skipped questions about byte order, would need further research]
Thanks for your answers so far, they are very useful.

I have some more questions though ;-)

Are mime types always stored lowercase in mime.cache? AFAIK mime types
are case insensitive, but it seem they're all lowercase in mime.cache
which makes the code simpler.

When checking multiple mime.cache files: Should I check those one after
one or should I first do glob matches in all databases, then magic
matches in all data bases, etc?

And one last (off-topic) question: Windows and OSX don't have similar
apis/databases, right? Windows seems to have a simple
extension-->mimetype mapping in the registry, but mac doesn't have a
public mime type api at all?

-- 
Johannes Pfau