simple search api (was Re: mimetype standardisation by testsets)

Wed Dec 13 23:29:45 EET 2006

Sorry for the late reply I've been totally bugged out on diseases and work!
Here goes :-)

Because of the long nature of this mail I summarize some important questions
in the bottom of the mail...

2006/11/30, Magnus Bergman <magnus.bergman at observer.net>:
>
>
> After reading everything in this thread and considered all concerns
> mentioned. I thinks it's time I come up with something concrete myself,
> but just criticizing others. From the requirements and suggestion
> mentioned on this list I tried to come up with a proposal with these
> things in mind:
> * The possibility for application authors to do searched very easily.
> * The possibility to do both synchronous and asynchronous searched
>   without having two different APIs.
> * Not ruling out the possibility to use a dbus interface directly.
> * Not ruling out the possibility to have a library.
> * Not causing the search engine to do unnecessary work (like repeating
>   searched if the hits need to be retrieved again)
>
> One problem that I choose to leave out (for now) is the need to be able
> to stream document which has a URL the applications cannot understand
> (document which are not files). This includes files inside other files
> and virtual documents that are constructed on demand. But at least I
> have had this in mind so it's not impossible to add it later.

Well, I don't think this belongs in search a api as such. This functionality
sounds more like a metadata storage to me... Which is planned for
standardization later - so, yeah, let's keep the options open, but punt the
issue for now.

Disclaimer: The names of the functions are not part of the proposal,
> they are just chosen to illustrate what the functions do. And this
> proposal does not suggest a library API over a dbus interface, the
> exact same idea applies to both cases. (It is also agnostic the whatever
> query language used.)
>
> First a set of three basic functions that alone does most things and
> are probably sufficient for everybody who want a simple API:
>
>   session_handle = session_new()
>
>     Creates a new session and returns a new session handle. Creating a
>     new session might involve finding an appropriate search engine and
>     getting it ready (exactly what happens here is not important). This
>     might just be to open a dbus connection. I think it's OK if this
>     call is blocking(?). Applications would probably want to call it
>     during startup and it should not take that long *too* do whatever
>     needs to be done here (which of course depends on the search
>     engine).
>
>   search_handle = search_new(session_handle,query_string)
>
>     Starts a new search and returns a new search handle. By default
>     this function blocks until the search has been performed and the
>     number of hits is known (see below).
>
>   hits = search_get_hits(search_id,max_number_of_hits)
>
>     Fetches a number of hits from the search. Each hit is a set of
>     attributes for the hit (by default it might be URL, score and
>     perhaps something else important). It can be called several times
>     to retrieve more hits (much like read(2)). The hits are are sorted,
>     by default by their score.

Ok, more minimalistic than the current simple interface, but I guess it
could work.

When I compared this interface to the current live one proposed at
http://wiki.freedesktop.org/wiki/WasabiSearchLive, my first thought was that
you session object was equivalent to the dbus connection made by the
application. In my proposal the app then uses the connection/session to
obtain a Query object with the NewQuery() method.

>From an application developers point of view this might be correct. I just
forgot to look at this through my search engine developers glasses :-) From
the search engines perspective it might actually be nice to have a parent
session for each query. This is actually something we have to ask the search
engine developers about. See the bottom of this mail.

For slightly less simple use there are some more functions:
>
>   session_free(session_handle)
>
    Frees all resources related to the session. This includes all
>     searches created from the session.

I think this should be available in a simple api if we use Session objects.
- I do realise that you only want one api though :-)

  session_set_search_finished_signal(session_handle,signal_handler)
>
>     Sets a signal handler which is invoked then a search has been
>     finished. The signal handler gets the search handle back so
>     different searched can be held apart. If this is set the function
>     search_new() will not block.

It seems simpler to me that the applications simple ask "are you done?" each
time it receives a new batch of hits. A bit more dbus traffic, but not much.
Thus having a search_is_finished(search_handle) instead (which you actually
define below).

  session_set_search_progress_signal(session_handle,signal_handler)
>
>     Sets a signal handler which is invoked then there are new hits
>     available (hits which hasn't been retrieved with
>     search_get_hits()). The signal handler gets the search_handler,
>     maybe some approximation about percentage of the progress and maybe
>     the number of new hits for convenience.

Why not return the hits with the signal? I see something cool in not
returning the results until the application specifically requests them
though. It reminds somewhat of the way spotlight does it, and it is also
closer to what libbeagle does. This is an important point - I added it to
the bottom of this mail.

  session_set_property(session_handle,property_name,value)
>
>     Sets a property of the session. This might include default sort
>     order, maximum number of hits (mostly as a hint to the search
>     engine), minimum score, default set of attributes for hits in new
>     searches, is searches should live on (never considered finished but
>     continue to generate new hits if new matching documents show up) and
>     probably some other stuff.

The properties you mention sounds more like properties of the query of you
ask me...

  value = session_get_property(session_handle,property_name)
>
>     Does the expected.
>
>   search_free(search_handle)
>
>     Frees all resources related to the session. The search handle
>     becomes invalid afterwards.
>
>   search_is_finished(search_handle)
>
>     Checks if the search is finished yet.

Check, check, and check on those methods.

  search_get_number_of_total_hits_so_far(search_handle)
>
>     Gets the total number of hits this search resulted in (minus the
>     ones discarded because of too low score of course). If the search
>     finished signal handler has been set the search might not yet be
>     finished and the number of hits so far is returned.
>
>   search_get_number_of_new_hits_so_far(search_handle)
>
>     Identical to the one above, but minus the number of hits already
>     retrieved using search_get_hits() (or skipped using
>     search_seek(), see bolow).
>
>   search_tell()
>
>     Tells how many hits that has been retrieved so far.

The above three methods doesn't feel right... There seems to be some book
keeping that could be done on the client side just as well.

  search_seek()
>
>     Moves the cursor in the search to either skip searches or go back
>     to read them again (much like lseek(2)). Yes, the name is bad, I
>     know (see disclaimer above). (Perhaps search_tell() and
>     search_seek() can be replaced by a property.)

Is this method actually useful? I think it needs real good justification
since it will introduce quite some work on the search engine side to support
(correct me if I'm wrong).

  search_set_property(search_handle,property_name,value)
>
>     Sets a property of the search. This might include sort order (for
>     remaining hits if some has already been retrieved), set of
>     attributes for hits and probably some other stuff.
>
>   value = search_get_property(search_handle,property_name)
>
>     Does the exget_querypected.
>

Question 1 : Will it benefit the search engine to have a Session object for
each connection? Then Query objects are spawned by a call like Magnus
suggest; Query = NewQuery(Session, query_string)? Is it correct that
applications doesn't need to care about sessions - just gimme the goddam
query! ? :-)

Question 2 : Should the results be returned with the HitsAdded signal? The
Query object then has a Query.GetResults method to retrieve the results.
This is closer to libbeagle and spotlight and the application only spends
time retrieving hits when it really wants to. It does introduce some extra
method calls though...

In the http://wiki.freedesktop.org/wiki/WasabiSearchLive proposal the
session and the query object is somewhat merged (since you can change a
running query (restarting it)). I personally think it is rather elegant, but
perhaps it is really just a mess.

Cheers, let's get  this ball rolling again. For the end users!
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20061213/a6092628/attachment.htm