simple search api (was Re: mimetype standardisation by testsets)

Fri Nov 24 16:15:39 EET 2006

Mikkel Kamstrup Erlandsen writes:
 > It seems to me that what you are suggesting is actually the RDF query
 > language (as I mentioned earlier).
 > I think it would be a good idea to have a method in an advanced interface to
 > query using RDF (or anything other we decide upon) - I just think we should
 > leave it out of org.freedesktop.search.simple.

No, I was not really suggesting RDF. I think that there is quite a lot of
room between the vast complexity in RDF and the current ultra-simple
approach. And RDF as I understand it is really primarily oriented towards
metadata, not text contents ? I don't know a lot about RDF actually.

 > > [phrases being case-sensitive]
 > The spotlight way of doing this is to add a c to the end of the phrase fx.
 > "Hello World"c. I dislike using letters like this, I could accept using a
 > symbol to mark a phrase as case sensitive.

I agree with you that his is not pretty, but it's certainly better than
nothing. 

 > >- wildcards/masking: maybe there should be some kind of option to turn this
 > >   on/off, but the current language does not make it easy. Or at the very
 > >   minimum specify \-escaping or such.
 > Do you want to turn this off in the search engine? I don't think I
 > understand what you mean...
 > We could have implicit escaping by quoting the string "c* algrebra"  fx. I'm
 > not against escaping of special chars as such. except I don't think is
 > standard in Lucene...

Yes I'd like to be able turn this off in the search engine. I would like a
generic interface to have a capacity to search for C* when this will be the
next programming language du jour. Quoting would probably be ok, if we
remove the case-matching issue, but then we lose the ability to mix
phrase/proximity search and wild-carding. That is, I'd like to have a
query-time choice to decide that "someword* someother" will mean either
 (someword OR somewordbla OR somewordblu) PHRASE 2 someother
or exactly
 someword* PHRASE 2 someother

Again, this is because we live in a time when the weirdest character
strings may need to be searchable, not just the good old dictionary
words. In my opinion, this interface should not be tailored strictly to
what the currently dominant indexer can handle.

 > - There must be some provision to control stemming. Again, something that
 > >   would be easy to do in a structured language, or already provided for in
 > >   one of the existing ones.
 > 
 > Assuming we are talking about the search language. Is a "language"  switch
 > not enough? "øl the language:danish" would match posts with "øl" and "the"
 > while "øl the language:english" would match fx. "ol" and discard "the" as a
 > stop word.

What I would like is a capability to control whether a search for
[flooring] will be expanded to [floor floors floorings floored] or not (in
addition to specifying the national language).

 > All indexable objects have a unique URI, that is kind of an unspoken premise
 > of the current draft. I don't think we should have a standard way to point
 > at an attachment inside an email. Evolution (the mail client) uses one kind
 > of uris and I expect KMail to use another - I don't find it realistic that
 > we expect them to change that. The only assumption I think we should make is
 > that indexed object be it conversations, emails, notes, etc, should be
 > uniquely determined by their uri.

Ok with this, but how does the client access the file name for a
multi-document file ? We may decide that this is not useful if we are
confident that it won't be ever needed.

 > - Using the query string as a query identifier is certainly feasible (ie
 > >   for repeated calls to Query() with successive offsets), but it somehow
 > >   doesn't feel right. Shouldn't there be some kind of specific query
 > >   identifier ? Query strings can be quite big (ie, after expansion by some
 > >   preprocessor).
 > 
 > That is a good point, this more or less implies the need for a server side
 > Query object... This requires a fair deal of logic to be imposed on the
 > backends though (session awareness), and I'm not keen on requiring any more
 > than an interface.

Don't you think that backends will have to have some level of
session-awareness for good performance anyway ? This or a query cache,
which might be even more complicated to manage.

There are several references to Lucene in your message. I really think that
this spec should also take into account the non-lucene ways to do things if
it is to offer real choices to users. Otherwise, why not be just satisfied
with the current and upcoming crop of Beagle front-ends ?

Regards,
J.F. Dockes