simple search api (was Re: mimetype standardisation by testsets)

Wed Dec 20 11:29:27 EET 2006

Mikkel Kamstrup Erlandsen writes:
 > 2006/12/19, Jean-Francois Dockes <jean-francois.dockes at wanadoo.fr>:
 > > Especially, returning the initial result as a sequence of uris (presumably
 > > ordered by relevance), and then using them as keys to retrieve properties
 > > is very awkward on the server side (forces maintaining a map from uris to
 > > result entries).
 > >
 > > I think it would be much simpler to have a single call like:
 > >
 > > Query (in s query, in i offset, in i limit, in as properties, out aa{ss}
 > > hits)
 > >
 > > where 'properties' is the list of requested properties, and 'hits' is an
 > > array of property (name,value) maps, one for each result entry.
 > 
 > I think you are quite right. Except that maybe the output parameter of
 > simple.Query should be a{sa{sas}} - a map mapping uris to maps of
 > property-valuelist pairs. The trick is that metadata fields can have several
 > values (like the simple.GetProperties method). If I request the Email.CC and
 > Email.To fields for example I'd get something like
 > 
 > {
 >   "email://mail_indetifier1" : {
 >     "Mail.CC" : [foo at bar.xyz, emfle at birnan.xyz],
 >     "Mail.To" : ["linus.torvalds at microsoft.com"]
 >   }
 >   "email://mail_indetifier2" : {
 >     "Mail.CC" : [foo at bar.xyz],
 >     "Mail.To" : ["bill at osdl.org"]
 >   }
 > }

This is were we disagree. You are requesting a seqence of 'limit' results,
starting at offset 'i'. There is no reason to have special treatment for
the URI. It's just another property. The result list should be like:

{
  "URI"     : "email://mail_indetifier1" 
  "Mail.CC" : [foo at bar.xyz, emfle at birnan.xyz]
  "Mail.To" : ["linus.torvalds at microsoft.com"]
}
{
  "URI"     : "email://mail_indetifier2"
  "Mail.CC" : [foo at bar.xyz],
  "Mail.To" : ["bill at osdl.org"]
}

Just an ordered sequence of maps, the implicit key to the sequence is the
record number from 'offset' to 'offset+limit'

I think that it is wrong to make the URI such a central element, it is not
so special for any backend I had had the opportunity to have a look at.

 > The GetSnippet method must have a query string to match up against -
 > GetProperties do only need an uri and a list of requested props.
 > Arguable they could both be merged into Query, but I feel it might be
 > overkill issuing a separate query to retrieve given metadata fields on a
 > given uri - that is more like a lookup in my mind (and also is for some
 > engines).

The GetSnippet method if you need to have one can use the same ordinal key
that Query() is implicitely using. Using the URI for this forces awkward
processing on the backend side with no benefit to the application (which
has to know the index of a result anyway).

 > You can't merge GetSnippet into you main query it is a relatively slow
 > operation on most engines, so you have to do that after you got the actual
 > hit.

Ok, so you don't request "Snippet" as a property in the initial query, and
re-call Query() with the appropriate record number, requesting the
"Snippet" property for the record you want the Snippet for. If getting a
snippet is slow and costly, using a dbus transaction for it should not be an
issue.

Or if you really want to, you could define a call requesting snippets for a
list of result numbers. All I'm saying is that 'URI' is not a good result
identifier. 

 > These was the reasons why I split the methods like I did and I still think
 > they hold...

My central point is that 'URI' is not a good result identifier. Results are
not organized by URI either on the application or backend side. The result
list is an ordered sequence, the natural accessor is the number in the
sequence.

Regards,
J.F. Dockes