simple search api (was Re: mimetype standardisation by testsets)

Fri Nov 24 16:28:18 EET 2006

Magnus Bergman writes:
 > On Fri, 24 Nov 2006 12:25:41 +0100
 > Jean-Francois Dockes <jean-francois.dockes at wanadoo.fr> wrote:
 > >
 > > [lots of query language ramblings]
 > 
 > I agree with everything above. By the way, it might also be useful
 > to be able to add weight to the sub-queries.

Something easily done with an xml attribute I believe, which could just as
easily be ignored by a backend which would not support it.

 > > - Documents and files are not the same thing (think email message
 > > inside an Inbox, Knotes). Both have their uses on the client side
 > > though (document identifier to request a snippet, or a text preview,
 > > file to, well, do something with the file). I don't know of a
 > > standard way to designate a message inside an mbox file, this is a
 > > tricky issue. We can probably see the document identifier as opaque,
 > > and interpreted only in the backend. The file identifier needs to be
 > > visible. Or is there a standard way to separate the File and Subdoc
 > > parts in what the draft calls uris ?
 > 
 > Are you also thinking the problem of presenting the right (virtual)
 > document to the user? Having an opaque identifier for the document is a
 > good idea. But this requires that the backend also knows about how to
 > create something the user can view out of this identifier, otherwise
 > it's not of much use. The indexer always has some kind of filter which
 > at least turns the document into a stream of words. Are these thoughts
 > perhaps beyond the scope of problem discussed?

Yes, in my view, we need two pieces of information about documents like
email messages:
 - An opaque handle (currently called uri in the spec), which the client
   can use to request things from the backend (such as a pure text preview,
   or snippets or whatever attributes).
 - A file name which may be of use to some clients. I'm not so completely
   sure it's strictly needed, but I'm also not comfortable deciding that
   it's not. Maybe this could be one of the retrievable attributes
   actually (out of GetProperties()). This would solve the issue inside the
   current interface definition, and ok with me.

 > > - Using the query string as a query identifier is certainly feasible
 > > (ie for repeated calls to Query() with successive offsets), but it
 > > somehow doesn't feel right. Shouldn't there be some kind of specific
 > > query identifier ? Query strings can be quite big (ie, after
 > > expansion by some preprocessor).
 > 
 > As I wrote before I think it's a good idea to have a search object. The
 > search represents a running/finished search and is created then the
 > search is started (by submitting the query). As opposed to a query
 > object which usually refers to a compiled query that might not have been
 > submitted yet.

Yes, I think there will be query and search objects (your terms) in any
reasonable back-end implementation. The question is how visible they will
be on the client side.

On the other hand (this refers also to one or your previous remarks), a
desktop search framework does have different issues compared to a shared
one: the index will typically be smaller and the available resources much
greater, so that it may be just acceptable to run a query again and again,
and count on the system cache to have the appropriate disk blocks in
memory...

jf