Questions regarding the shared mime spec

Tue Aug 9 09:54:34 PDT 2011

Hi,
I'm writing a MIME implementation for the D programming language. As I
want to eventually submit this for inclusion in the D standard library
it must be boost licensed and I therefore can't look at existing
implementations. This means I can only use the specification and the
mime test database to verify the correctness of my implementation.

Reading the specification, some questions occurred:

The LiteralList:
Is it safe to assume that the list is consistent?
I.e. could there be a case insensitive "Hello" entry and another,
conflicting, case insensitive "hello" entry?
Can the list contain two entries which are only different in casing, one
case-sensitive, one not? I.e. can the list contain a case sensitive
"Hello" entry and a case insensitive "hello" entry? If so should "Hello"
match the case sensitive variant ore the case insensitive one?

The Reverse Suffix Tree:
Docs are very sparse here, took me some time to figure out what a suffix
tree is and how to search it ;-)

Can the RST tree contain literals, or are those guaranteed to be in the
LITERALS list?

ReverseSuffixTreeNode.CHARACTER: What encoding is used? I guess UTF32?
Can all characters be matched literally? Or could the tree contain
special glob characters like '*'? (guess no special characters)

MagicList.MAX_EXTENT
I finally found out what that's meant to be, but one sentence to explain
the meaning of this field wouldn't hurt.

The whole Magic/Match/Matchlet stuff could use some documentation.

Regarding the recommended checking order:
"Otherwise, start by doing a glob match of the filename."
In which order should LITERALs the RST and GLOBS be checked?
Should all 3 of those always be checked?
For example if a LITERAL match is found, should the RST/GLOBS still be
checked? (guess not)

"If any of the mimetypes resulting from a glob match is equal to or a
subclass of the result from the magic sniffing, use this as the result."
Should this check be done against _all_ matching GLOBS/RST entries, or
against the list obtained in step 2 ("only biggest weight. If the
patterns are different, keep only globs with the longest pattern")

"Otherwise use the result of the glob match that has the highest weight."
What if there are multiple, different matches with same length & weight?
Return "application/octet-stream" or the first match?

The spec assumes there's at most one MAGIC match. What if there are
multiple matches? Use the one with the highest  PRIORITY? What if there
are multiple matches with the same PRIORITY?

I also had some issues with the test suite:

My "by-Name" implementation fails for the following files:
test-template.dot, aportis.pdb, sqlite2.kexi, subtitle-microdvd.sub,
simple-obj-c.m, linguist.ts, test.ogg

There's a common pattern with all those tests:  My implementation finds
multiple equal matches in the tree and bails out according to the spec:
"If a matching pattern is provided by two or more MIME types,
applications SHOULD not rely on one of them. They are instead supposed
to use magic data (see below)"

Example for test-template.dot:
Tree: '[{Type: 'application/msword-template' Weight: '50' CaseSensitive:
'false' Flags: '[0,0,0]'},{Type: 'text/vnd.graphviz' Weight: '50'
CaseSensitive: 'false' Flags: '[0,0,0]'}]' Expected:
'application/msword-template'
The test suite always assumes the first of those results is returned?
Which implementation is correct in this case? I see no reason why
'application/msword-template' should be used here, 'text/vnd.graphviz'
has exactly the same weight, flags and matching pattern.

Another question: why do the bug-30656-xchat.conf/menu.ini tests return
"application/octet-stream"? Those are text files, so if text/binary
guessing is used the result should be "text/plain"? Is the spec out of
date and binary/text guessing is obsolete?

I also have issues understanding the test.jks MAGIC test:
The magic value in the freedesktop.org.xml is "0xfeedfeed" ==> [254,
237, 254, 237]  type host32.
My implementation reads that value from the cache file. I test on
x86-->LittleEndian. WORD_SIZE is 4, so I change the magic value as
indicated by the specs: [237, 254, 237, 254]
The check, however fails, as test.jks starts with [254, 237, 254, 237]?
What's wrong here, I'm pretty sure I'm supposed to byteswap VALUE?

I was also surprised, why none of the other host32 magic tests failed:
Turns out all does tests are completely independent of byte swapping:

The application/x-java-jce-keystore magic value "0xcececece" is exactly
the same if swapped or not. (BTW: why is this then marked host32?
Doesn't this cause unnecessary byte swapping?)

The application/vnd.tcpdump.pcap value is similar:
The xml file contains these host32 values: "0xa1b2c3d4" and
"0xd4c3b2a1". Of course, one of these will match in any case. Again, why
aren't those both stored as big32/little32 to avoid the byte swapping at
runtime?

-- 
Johannes Pfau

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freedesktop.org/archives/xdg/attachments/20110809/0c8d065b/attachment.htm>