simple search api (was Re: mimetype standardisation by testsets)

Thu Dec 21 20:07:26 EET 2006

Mikkel Kamstrup Erlandsen writes:
 > Having a unique handle for each document is a really handy thing. A unique
 > handle could be many other things than an uri, fx a unique integer or
 > anything as well. The way you describe this handle is just a number
 > specifying an entry in an array and is in this way not unique (except in the
 > context of a query).

That is right, the result sequence entry number is only usable in the
context of a given query.

 > The uri is just a more convenient handle than fx an integer - for
 > applications at least (the toolkit and platform libs often handle uris
 > directly).

It depends, this is only true in the case of documents with a well-known
URI-based access method, which is not the general case (ie: not true for email
messages).

 > I respect your disagreement, and would really like to hear what the other
 > guys think...

Me too ... I am going to restate the problem as I see it, in a way which is
probably somewhat xapian-specific. Hopefully someone will be able to concur
or contradict this for other engines.

After it has received and processed a query, the backend has some kind of
structure that represents the results. This is an ordered sequence of
documents, the order depends on the kind of sorting that was
requested. Typically and by default this will be by order of relevance.

The result set is mostly organized to be accessed by sequence number, and
things are set up so that you can retrieve auxiliary data for a given
result: a result entry will have some kind of document number which can be
used in turn to access document data efficiently.

Access by URI is not part of the scheme.

This means that a backend for the current WasabiSimple api would have to:

 - In response to Query(), run a query, then walk the result set to request
   document data, select URIS, and return these only (while it has access to
   all the documents field at the same time at no additional cost).

 - In response to GetProperties() or GetSnippets() either run another,
   different query to get the corresponding document data (this is an
   actual db query using the URI as a unique term, not the simpler and more
   efficient access to data by document number), or use some kind of cache
   that it would hopefully have built in response to the first Query()
   call, hoping that the  requested URI->document association is
   actually cached.

This is very awkward, and offers no real benefit to the application, which
would as well decide initially what kind of metadata it wants back, and get
it all in response to the Query() call.

 > Pasting in Jean-Francois'  follow up mail:
 >  > As an afterthought to my previous message (sorry), the result list could
 >  > change if the query has to be re-run. This is a good reason for keeping
 >  > the uris as document identifiers for getSnippets().
 > 
 > It would feel akward if you had to request a specific property (the uri) to
 > be able to obtain a snippet IMHO.

Sorry, but I can't see why. It would be vastly less awkward than the current
proposal, where you initially get a set of data for which you have no use
(try displaying a list of bare URIS to the user and see how they like it),
and then immediately make other calls to request data for each and every of
the initial results (because there is no way to select among them).

 > Ok, my central point is: We need a unique handle for each document/object in
 > store - this should be used to identify the returned hits from Query().
 > Whether or not an opaque handle of some undefined sort or it is defined to
 > be the uri is another matter.

The nice thing with the URI is that it is independant of the query. The fact
that queries are only identified by the query string sort of implies that
the query result list might not be fully stable across calls, so that using
result sequence numbers as identifiers would be inappropriate. 

So either we need a unique *query* identifier which would at least
enable detecting that the result set is stale, or we use URIS as document
identifiers for calls subsequent to the initial Query(). If we really want
to keep a separate GetSnippets() call, my proposal would be to:

 - Return all desired metadata with the initial Query() call as an
   appropriately ordered sequence (as determined by the user query sort
   parameters).
 - Request snippets by URI (which may imply running a different query for
   each snippet on the backend side, but hopefully not for all initial
   results).

Note that both in this approach and in the current WasabiSimple proposal,
it may probably happen under some circonstances that the result sequence
obtained by successive Query() calls would be unperfect, with duplicates or
holes (if the database has changed and the backend had to run the query
again for some reason). This forces requesting Snippets by URI.

 > To the sorting problem I see two solutions. 
 > 1) Always return a score property as part of the response properties as
 >    defined in my proposal.  
 > 2) Always include the UniqueHandle property as part of the response as
 > defined by your proposal.

1)  would have to be extended to some sort of ordinal value (not always
representing a score, the results might have been requested ordered by ie,
date), and it burdens the application with sorting the results again. Why
make things so complicated ?

About 2), URI *is* an appropriate handle, and probably the best as long as
we can't guarantee the stability of the result set (that is: *if* we need
separate Query() and GetSnippets() calls, *then* the URI is probably the
best identifier to ensure consistency).

Regards,
JF