Questions regarding the shared mime spec

Sat Oct 1 08:17:58 PDT 2011

Am 26.09.2011 10:07, schrieb Alexander Larsson:
> On Sun, 2011-09-25 at 09:08 +0200, David Faure wrote:
>> Hi Johannes,
>> [skipping a few questions about mime.cache which I don't know yet]
>>> ReverseSuffixTreeNode.CHARACTER: What encoding is used? I guess UTF32?
>> UTF-8, rather? I don't think this code uses UTF-32 anywhere.
>> Not sure it was tested for non-ascii globs anyway.
> Filenames in unix are octets with any values except 0 and '/', and do
> not have a defined encoding per se. In actual use we declare some
> encoding (different depending on the user) to be used in order to be
> able to display the filenames, but that doesn't make the other filenames
> invalid or impossible to exist.
>
> Its possible to convert some such encoded files to utf8 and then do
> matching in utf8 so that you could match a non-ascii glob, but that
> would be problematic in the case where you had a file with a known ascii
> suffix (say .gnumeric) but a filename that is not convertible to UTF8,
> because in that case you wouldn't even be able to do a ascii match.
>
Never thought that could be an issue. Filenames in D are always in UTF8,
so I might have to file a bug report there.
> So, in practice I don't ever think it will work with non-ascii
> characters in mime extensions or globs. So, the
> ReverseSuffixTreeNode.CHARACTER characters are binary octets to match
> the filename octets, but in practice only contain ascii characters.
>
>>> [...]
>>> Regarding the recommended checking order:
>>> "Otherwise, start by doing a glob match of the filename."
>>> In which order should LITERALs the RST and GLOBS be checked?
>>> Should all 3 of those always be checked?
>>> For example if a LITERAL match is found, should the RST/GLOBS still be
>>> checked? (guess not)
>> Interestingly enough, implementations that don't use mime.cache don't 
>> distinguish between these types of globs. So I think they should all be 
>> checked, and then if you get more than one match, you go into the "multiple 
>> globs matched" resolution (i.e. sniffing and sorting it out).
> Yeah, distinguishing these is merely an optimization in the
> implementation really. For all intents they are globs of a specific
> form, of glob, as is the things in the suffix tree.
>
>>> "If any of the mimetypes resulting from a glob match is equal to or a
>>> subclass of the result from the magic sniffing, use this as the result."
>>> Should this check be done against _all_ matching GLOBS/RST entries, or
>>> against the list obtained in step 2 ("only biggest weight. If the
>>> patterns are different, keep only globs with the longest pattern")
>> If this is about that "glob conflict" resolution, then the goal is to choose 
>> which of the matched globs is best. So it should be only "against the list 
>> obtained by glob matching". Otherwise you get funny results.
>> The extension is still a pretty good hint, we should use it, not just rely on 
>> fragile magic-only.
>>
>>> "Otherwise use the result of the glob match that has the highest weight."
>>> What if there are multiple, different matches with same length & weight?
>>> Return "application/octet-stream" or the first match?
>> One of the matches, e.g. the first match.
>>
>>> The spec assumes there's at most one MAGIC match. What if there are
>>> multiple matches? Use the one with the highest  PRIORITY? What if there
>>> are multiple matches with the same PRIORITY?
>> Good question. In practice my code stops at the first match. I rely on the 
>> fact that the magic rules are sufficiently well written. But "the one with the 
>> highest priority" seems like a safer choice indeed. And then if there are two 
>> with the same priority, we have no choice but to pick one.
> In cases like these its pretty common to chose the longest match. I
> don't know if we do this at the moment in gnome though.
>
What does 'longest match' mean in this case? The Match where the most
bytes matched?
>>> Another question: why do the bug-30656-xchat.conf/menu.ini tests return
>>> "application/octet-stream"? Those are text files, so if text/binary
>>> guessing is used the result should be "text/plain"? Is the spec out of
>>> date and binary/text guessing is obsolete?
>> No, binary/text guessing was never implemented in xdgmime, and that's a bug 
>> indeed. It turns out that I fixed it last week, see attached diff.
> For historical reasons we implement the text guess in glib when using
> xdgmime, we should probably move that down to xdgmime.
>
>> I just committed it, but I'm attaching the patch so that Alexander or Bastien 
>> can review it, this is my first commit to xdgmime [which seems to still be in 
>> CVS? I thought everything had moved to git?]
> Weird. I have an old clone here pointing at:
> ssh://alexl@git.freedesktop.org/git/mime/xdgmime.git
> But that doesn't seem to exist anymore??
>
> Anyway, some issues with the patch:
>
> The text check should be done in xdg_mime_get_mime_type_for_data, not
> just  _xdg_mime_cache_get_mime_type_for_data, in case some directory
> doesn't have a cache file and we fall back on using the "magic" file.
>
> Also, our current code in glib doesn't limit itself to the first 32
> chars, but looks at all the data we have read anyway. Is there a
> particular reason to stop early? Performance?
>
>
> _______________________________________________
> xdg mailing list
> xdg at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/xdg


-- 
Johannes Pfau