<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=ISO-8859-15"> </head> <body text="#000000" bgcolor="#ffffff"> Hi, I'm writing a MIME implementation for the D programming language. As I want to eventually submit this for inclusion in the D standard library it must be boost licensed and I therefore can't look at existing implementations. This means I can only use the specification and the mime test database to verify the correctness of my implementation. Reading the specification, some questions occurred: The LiteralList: Is it safe to assume that the list is consistent? I.e. could there be a case insensitive "Hello" entry and another, conflicting, case insensitive "hello" entry? Can the list contain two entries which are only different in casing, one case-sensitive, one not? I.e. can the list contain a case sensitive "Hello" entry and a case insensitive "hello" entry? If so should "Hello" match the case sensitive variant ore the case insensitive one? The Reverse Suffix Tree: Docs are very sparse here, took me some time to figure out what a suffix tree is and how to search it ;-) Can the RST tree contain literals, or are those guaranteed to be in the LITERALS list? ReverseSuffixTreeNode.CHARACTER: What encoding is used? I guess UTF32? Can all characters be matched literally? Or could the tree contain special glob characters like '*'? (guess no special characters) MagicList.MAX_EXTENT I finally found out what that's meant to be, but one sentence to explain the meaning of this field wouldn't hurt. The whole Magic/Match/Matchlet stuff could use some documentation. Regarding the recommended checking order: "Otherwise, start by doing a glob match of the filename." In which order should LITERALs the RST and GLOBS be checked? Should all 3 of those always be checked? For example if a LITERAL match is found, should the RST/GLOBS still be checked? (guess not) "If any of the mimetypes resulting from a glob match is equal to or a subclass of the result from the magic sniffing, use this as the result." Should this check be done against _all_ matching GLOBS/RST entries, or against the list obtained in step 2 ("only biggest weight. If the patterns are different, keep only globs with the longest pattern") "Otherwise use the result of the glob match that has the highest weight." What if there are multiple, different matches with same length & weight? Return "application/octet-stream" or the first match? The spec assumes there's at most one MAGIC match. What if there are multiple matches? Use the one with the highest PRIORITY? What if there are multiple matches with the same PRIORITY? I also had some issues with the test suite: My "by-Name" implementation fails for the following files: test-template.dot, aportis.pdb, sqlite2.kexi, subtitle-microdvd.sub, simple-obj-c.m, linguist.ts, test.ogg There's a common pattern with all those tests: My implementation finds multiple equal matches in the tree and bails out according to the spec: "<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: 'Times New Roman'; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; font-size: medium;">If a matching pattern is provided by two or more MIME types, applications SHOULD not rely on one of them. They are instead supposed to use magic data (see below) " Example for test-template.dot: Tree: '[{Type: 'application/msword-template' Weight: '50' CaseSensitive: 'false' Flags: '[0,0,0]'},{Type: 'text/vnd.graphviz' Weight: '50' CaseSensitive: 'false' Flags: '[0,0,0]'}]' Expected: 'application/msword-template' The test suite always assumes the first of those results is returned? Which implementation is correct in this case? I see no reason why 'application/msword-template' should be used here, 'text/vnd.graphviz' has exactly the same weight, flags and matching pattern. Another question: why do the bug-30656-xchat.conf/menu.ini tests return "application/octet-stream"? Those are text files, so if text/binary guessing is used the result should be "text/plain"? Is the spec out of date and binary/text guessing is obsolete? I also have issues understanding the test.jks MAGIC test: The magic value in the freedesktop.org.xml is "0xfeedfeed" ==> [254, 237, 254, 237] type host32. My implementation reads that value from the cache file. I test on x86-->LittleEndian. WORD_SIZE is 4, so I change the magic value as indicated by the specs: [237, 254, 237, 254] The check, however fails, as test.jks starts with [254, 237, 254, 237]? What's wrong here, I'm pretty sure I'm supposed to byteswap VALUE? I was also surprised, why none of the other host32 magic tests failed: Turns out all does tests are completely independent of byte swapping: The application/x-java-jce-keystore magic value "0xcececece" is exactly the same if swapped or not. (BTW: why is this then marked host32? Doesn't this cause unnecessary byte swapping?) The application/vnd.tcpdump.pcap value is similar: The xml file contains these host32 values: "0xa1b2c3d4" and "0xd4c3b2a1". Of course, one of these will match in any case. Again, why aren't those both stored as big32/little32 to avoid the byte swapping at runtime? <pre class="moz-signature" cols="72">-- Johannes Pfau</pre> </body> </html>