simple search api (was Re: mimetype standardisation by testsets)

Magnus Bergman magnus.bergman at observer.net
Thu Nov 30 17:40:34 EET 2006


On Sun, 19 Nov 2006 12:19:45 +0100
"Jos van den Oever" <jvdoever at gmail.com> wrote:

> Hi Mikkel,
> 
> Yes, the common dbus api is still something we need. I wanted to start
> on the metadata standarization first, but we can do the searching api
> in parallel. You make a good start in listing the available engines.
> There might even be more. To coordinate we need a process that lists
> the available search engines over dbus. An application should be able
> to say: I want to search using a particular interface with the
> available search engines.
> 
> The attached archive contains an effort to do two things:
> - propose a very simple, common api for search engines
> - implement such a coordinating daemon
>   The code contains the daemon, a demo search application and a python
> client to access it by finding the search engine over the
> searchmanager.

After reading everything in this thread and considered all concerns
mentioned. I thinks it's time I come up with something concrete myself,
but just criticizing others. From the requirements and suggestion
mentioned on this list I tried to come up with a proposal with these
things in mind:
* The possibility for application authors to do searched very easily.
* The possibility to do both synchronous and asynchronous searched
  without having two different APIs.
* Not ruling out the possibility to use a dbus interface directly.
* Not ruling out the possibility to have a library.
* Not causing the search engine to do unnecessary work (like repeating
  searched if the hits need to be retrieved again)

One problem that I choose to leave out (for now) is the need to be able
to stream document which has a URL the applications cannot understand
(document which are not files). This includes files inside other files
and virtual documents that are constructed on demand. But at least I
have had this in mind so it's not impossible to add it later.

Disclaimer: The names of the functions are not part of the proposal,
they are just chosen to illustrate what the functions do. And this
proposal does not suggest a library API over a dbus interface, the
exact same idea applies to both cases. (It is also agnostic the whatever
query language used.)

First a set of three basic functions that alone does most things and
are probably sufficient for everybody who want a simple API:

  session_handle = session_new()

    Creates a new session and returns a new session handle. Creating a
    new session might involve finding an appropriate search engine and
    getting it ready (exactly what happens here is not important). This
    might just be to open a dbus connection. I think it's OK if this
    call is blocking(?). Applications would probably want to call it
    during startup and it should not take that long *too* do whatever
    needs to be done here (which of course depends on the search
    engine).

  search_handle = search_new(session_handle,query_string)

    Starts a new search and returns a new search handle. By default
    this function blocks until the search has been performed and the
    number of hits is known (see below).

  hits = search_get_hits(search_id,max_number_of_hits)

    Fetches a number of hits from the search. Each hit is a set of
    attributes for the hit (by default it might be URL, score and
    perhaps something else important). It can be called several times
    to retrieve more hits (much like read(2)). The hits are are sorted,
    by default by their score.

For slightly less simple use there are some more functions:

  session_free(session_handle)

    Frees all resources related to the session. This includes all
    searches created from the session.

  session_set_search_finished_signal(session_handle,signal_handler)

    Sets a signal handler which is invoked then a search has been
    finished. The signal handler gets the search handle back so
    different searched can be held apart. If this is set the function
    search_new() will not block.

  session_set_search_progress_signal(session_handle,signal_handler)

    Sets a signal handler which is invoked then there are new hits
    available (hits which hasn't been retrieved with
    search_get_hits()). The signal handler gets the search_handler,
    maybe some approximation about percentage of the progress and maybe
    the number of new hits for convenience.

  session_set_property(session_handle,property_name,value)

    Sets a property of the session. This might include default sort
    order, maximum number of hits (mostly as a hint to the search
    engine), minimum score, default set of attributes for hits in new
    searches, is searches should live on (never considered finished but
    continue to generate new hits if new matching documents show up) and
    probably some other stuff.

  value = session_get_property(session_handle,property_name)

    Does the expected.

  search_free(search_handle)

    Frees all resources related to the session. The search handle
    becomes invalid afterwards.

  search_is_finished(search_handle)

    Checks if the search is finished yet.

  search_get_number_of_total_hits_so_far(search_handle)

    Gets the total number of hits this search resulted in (minus the
    ones discarded because of too low score of course). If the search
    finished signal handler has been set the search might not yet be
    finished and the number of hits so far is returned.

  search_get_number_of_new_hits_so_far(search_handle)

    Identical to the one above, but minus the number of hits already
    retrieved using search_get_hits() (or skipped using
    search_seek(), see bolow).

  search_tell()

    Tells how many hits that has been retrieved so far.

  search_seek()

    Moves the cursor in the search to either skip searches or go back
    to read them again (much like lseek(2)). Yes, the name is bad, I
    know (see disclaimer above). (Perhaps search_tell() and
    search_seek() can be replaced by a property.)

  search_set_property(search_handle,property_name,value)

    Sets a property of the search. This might include sort order (for
    remaining hits if some has already been retrieved), set of
    attributes for hits and probably some other stuff.

  value = search_get_property(search_handle,property_name)

    Does the expected.



More information about the xdg mailing list