simple search api (was Re: mimetype standardisation by testsets)

Fri Dec 22 14:29:24 EET 2006

2006/12/21, Jean-Francois Dockes <jean-francois.dockes at wanadoo.fr>:
>
> Mikkel Kamstrup Erlandsen writes:
> > Having a unique handle for each document is a really handy thing. A
> unique
> > handle could be many other things than an uri, fx a unique integer or
> > anything as well. The way you describe this handle is just a number
> > specifying an entry in an array and is in this way not unique (except in
> the
> > context of a query).
>
> That is right, the result sequence entry number is only usable in the
> context of a given query.
>
> > The uri is just a more convenient handle than fx an integer - for
> > applications at least (the toolkit and platform libs often handle uris
> > directly).
>
> It depends, this is only true in the case of documents with a well-known
> URI-based access method, which is not the general case (ie: not true for
> email
> messages).

Well URL and file URIs are readily usable for the apps, and they account for
a very large portion of the results in many cases. Also, if there ever comes
a standard URI for pointing at emails, apps could start using that right
away.

> I respect your disagreement, and would really like to hear what the other
> > guys think...
>
> Me too ... I am going to restate the problem as I see it, in a way which
> is
> probably somewhat xapian-specific. Hopefully someone will be able to
> concur
> or contradict this for other engines.
>
> After it has received and processed a query, the backend has some kind of
> structure that represents the results. This is an ordered sequence of
> documents, the order depends on the kind of sorting that was
> requested. Typically and by default this will be by order of relevance.
>
> The result set is mostly organized to be accessed by sequence number, and
> things are set up so that you can retrieve auxiliary data for a given
> result: a result entry will have some kind of document number which can be
> used in turn to access document data efficiently.
>
> Access by URI is not part of the scheme.
>
> This means that a backend for the current WasabiSimple api would have to:
>
> - In response to Query(), run a query, then walk the result set to request
>    document data, select URIS, and return these only (while it has access
> to
>    all the documents field at the same time at no additional cost).
>
> - In response to GetProperties() or GetSnippets() either run another,
>    different query to get the corresponding document data (this is an
>    actual db query using the URI as a unique term, not the simpler and
> more
>    efficient access to data by document number), or use some kind of cache
>    that it would hopefully have built in response to the first Query()
>    call, hoping that the  requested URI->document association is
>    actually cached.
>
> This is very awkward, and offers no real benefit to the application, which
> would as well decide initially what kind of metadata it wants back, and
> get
> it all in response to the Query() call.

I think the much the same applies to Lucene. I'll have to ask some of the
Lucene experts in the office in the new year.

Anyway, it is not a herculean task for the search engine to keep a
ring-buffer-like hashmap of the last 100 hits, uri->doc_id. Easier said than
done of course since hashmaps has no obvious way of acting as a ring buffer.
There is also synchronization issues when the underlying index changes of
course...

> Pasting in Jean-Francois'  follow up mail:
> >  > As an afterthought to my previous message (sorry), the result list
> could
> >  > change if the query has to be re-run. This is a good reason for
> keeping
> >  > the uris as document identifiers for getSnippets().
> >
> > It would feel akward if you had to request a specific property (the uri)
> to
> > be able to obtain a snippet IMHO.
>
> Sorry, but I can't see why. It would be vastly less awkward than the
> current
> proposal, where you initially get a set of data for which you have no use
> (try displaying a list of bare URIS to the user and see how they like it),
> and then immediately make other calls to request data for each and every
> of
> the initial results (because there is no way to select among them).
>
> > Ok, my central point is: We need a unique handle for each
> document/object in
> > store - this should be used to identify the returned hits from Query().
> > Whether or not an opaque handle of some undefined sort or it is defined
> to
> > be the uri is another matter.
>
> The nice thing with the URI is that it is independant of the query. The
> fact
> that queries are only identified by the query string sort of implies that
> the query result list might not be fully stable across calls, so that
> using
> result sequence numbers as identifiers would be inappropriate.
>
> So either we need a unique *query* identifier which would at least
> enable detecting that the result set is stale, or we use URIS as document
> identifiers for calls subsequent to the initial Query(). If we really want
> to keep a separate GetSnippets() call, my proposal would be to:
>
> - Return all desired metadata with the initial Query() call as an
>    appropriately ordered sequence (as determined by the user query sort
>    parameters).
> - Request snippets by URI (which may imply running a different query for
>    each snippet on the backend side, but hopefully not for all initial
>    results).

Ok. In this case I suggest always including a Object.URI property in the
return values. This way platform bindings can have a Hit.GetUri() method
that is guaranteed to make sense. With platform-binding-niceness in mind,
there could even be a *small* set of obligatory properties to return
(including uri) - I can't think of any other right now.

Note that both in this approach and in the current WasabiSimple proposal,
> it may probably happen under some circonstances that the result sequence
> obtained by successive Query() calls would be unperfect, with duplicates
> or
> holes (if the database has changed and the backend had to run the query
> again for some reason). This forces requesting Snippets by URI.

Yeah right. The simple api has inevitable shortcomings when it comes to
query integrity. I suspect most apps using the simple query to only call
Query() once per actual user query (ie don't do paging), in this case query
integrity is less compromised.

> To the sorting problem I see two solutions.
> > 1) Always return a score property as part of the response properties as
> >    defined in my proposal.
> > 2) Always include the UniqueHandle property as part of the response as
> > defined by your proposal.
>
> 1)  would have to be extended to some sort of ordinal value (not always
> representing a score, the results might have been requested ordered by ie,
> date), and it burdens the application with sorting the results again. Why
> make things so complicated ?
>
> About 2), URI *is* an appropriate handle, and probably the best as long as
> we can't guarantee the stability of the result set (that is: *if* we need
> separate Query() and GetSnippets() calls, *then* the URI is probably the
> best identifier to ensure consistency).

The thing is that I think we should keep an eye out for keeping the Simple
and Live interfaces as consistent as possible. Returning and using similar
data structures and such. In the Live interface you simply have to include
some sorting data in the hits  because they can be added and removed
dynamically.

I suggested making URI a obligatory property to return. Maybe we have to
have some sorting info available too... In the simple api the sorting
property(ies) would be redundant though.

Cheers,
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20061222/23ce24e6/attachment.htm