Case insensitive mimetype matching edge case
alexl at redhat.com
Fri Aug 21 01:29:50 PDT 2009
On Wed, 2009-08-19 at 21:53 +0200, David Faure wrote:
> On Wednesday 19 August 2009, Alexander Larsson wrote:
> > On Wed, 2009-08-19 at 10:02 +0200, David Faure wrote:
> > > On Wednesday 19 August 2009, Alexander Larsson wrote:
> > > > Ugh. Additionally we have to extend the mime.cache format more. Maybe
> > > > we can solve this with a hack. What about this:
> > > >
> > > > All case insensitive globs are converted to lower case in the globs
> > > > file. Glob lookup is done by first matching the real filename against
> > > > the globs, then (on failure) convert the name to lower case and try
> > > > again. This will result in a case insensitive match except for things
> > > > marked as case sensitive that has at least one uppercase character.
> > > >
> > > > We can't do case-sensitive matching of only-lowercase globs, but we
> > > > don't currently have any example of this in the databases.
> > >
> > > But I do want to do one of those, to solve bug 22634: I want
> > > <glob pattern="core"/> to be case-sensitive="true".
> > >
> > > How about a different hack:
> > > we generate in globs2 two lines, in case of case-sensitive:
> > > 50:text/x-c++src:*.C
> > > 50:text/x-c++src:*.C:cs
> > > Old parsers will create an entry for "*.C:cs", which will probably never
> > > match any real file, so no big deal, while new parsers will take the
> > > second line as an indication that the *.C glob (parsed one line above)
> > > should be understood to be case sensitive.
> > Hmmm. I like this one. Sounds good to me. But lets make it extensible
> > when we're doing it, i.e. have a comma-separated list of flags with
> > "cs" being one known one. Unknown flags are ignored, anything after
> > another : is ignored.
> Good idea.
> I made the changes in the spec, in the definition of the two mimetypes,
> and in update-mime-database.c (for parsing, and globs2 generation).
> Please find patch attached (I can commit if you're ok with it).
We must also mention that if a case sensitive match matches that has
priority over the case insensitive match, otherwise the *.c vs *.C match
will not work.
> I included a suggested format change for the mimeinfo.cache file, but I'll
> have to let you implement that part, I don't know all the details about the
> suffix tree etc. Same for the xdgmime implementation.
I don't think the mimeinfo.cache changes are quite right. The literals
are stored sorted and looked up with a binary search. We can't apply the
flag once we've found the match as we won't match without the flag.
Rather we have to add an additional CaseInsensitiveLiteralList. Also,
we'd have to specify that the elements in this list is stored in lower
case, sorted, so that case insensitive bsearch works.
For the glob list a simple flag works.
The suffix tree is a search tree starting from the end of the filename.
It works like this:
Take a string "foobar.tar.gz", then start at the node of the tree root
and look for the current char (z in this case) by doing a binary search
on the children. If you find z, follow that and continue looking for a
hit on the next char. If any search fails to match with the current
char, look for a child with character 0, if found, these point to the
matching mimetype. If no such hit, return back to backtrack looking for
This has the same issue as the literals, so we have to make there be two
trees, one case sensitive and one case insensitive.
Furthermore, these changes are incompatible, so we need to bump the
minor version in the header.
More information about the xdg