simple search api (was Re: mimetype standardisation by testsets)

Wed Nov 22 17:00:18 EET 2006

On Sun, 19 Nov 2006 12:19:45 +0100
"Jos van den Oever" <jvdoever at gmail.com> wrote:

> Hi Mikkel,
> 
> Yes, the common dbus api is still something we need. I wanted to start
> on the metadata standarization first, but we can do the searching api
> in parallel. You make a good start in listing the available engines.
> There might even be more. To coordinate we need a process that lists
> the available search engines over dbus. An application should be able
> to say: I want to search using a particular interface with the
> available search engines.
> 
> The attached archive contains an effort to do two things:
> - propose a very simple, common api for search engines
> - implement such a coordinating daemon
>   The code contains the daemon, a demo search application and a python
> client to access it by finding the search engine over the
> searchmanager.
> 
> The proposal for the search api is _very_ simple and I call for
> application developers to see if the function calls in there are
> sufficient.
> Here i paste them for convenience:
> 
> 
>  interface org.freedesktop.search.simple
> 
> method startConfiguration ( )
> Open a graphical interface for configuring of search tool.
> 
> method countHits ( in s query , out i count )
> Count the number of instances of a file that match a particular query.
> Input:
> query
>     The query being performed.
> Output:
> count
>     The number of documents that match the query.
> 
> method query ( in s query, in i offset, in i limit , out as hits )
> Perform a query and return a list of files that match the query.
> Input:
> query
>     The query being performed.
> offset
>     The offset in the result list for the first returned result.
> limit
>     The maximum number of results that should be returned.
> 
> Output:
> hits
>     A list if filenames that are the result of the query.
> 
> method getProperties ( in as files,in a(sa(sas)) properties )
> Get properties for the given files.
> Input:
> files
>     A list of files for which properties should be returned.
> properties
>     The properties belonging to each file. Each property is a name
> associated with a list of string values. The index of each property
> map in the list corresponds to the index of the filename in the list
> of files.

I have constructed a in-house application which does pretty much
exactly what you describe (it doesn't yet speak dbus, but corba and
soap). Sadly I'm not allowed to release the source of this application,
but at least I can share some of my experience. (I haven't yet looked
closely on your source, so I might have misunderstood some things)

If several search engines are available, the search manager lets the
client know of each search engine according to your proposal (right?).
I think it would be a better idea to present a list of indexes (of which
each search engine might provide several) to search in, but by default
search in all of them (if appropriate). Instead of registering the the
search engine I think it's better to think in terms of creating a
session (which might still do exactly the same thing). Because this
should affect all appropriate search engines transparently. And because
it might be desired to alter some options for the session (language,
fussiness, search contexts and such).

In addition to this session object I have found it suitable to also
have a search object (created from a query) because applications might
construct very complicated queries. This object can then is passed
to countHits, and used for getting the hits. And also for getting
attributes of the hit (matching document, score, language and such).
(Note that a hit is not equivalent to a document.)

Daemon or no daemon, that is the question. This is a question that
without doubt will arise (it always does). First we need to clarify that
there is a difference between a daemon doing the indexing of document
(or rather detecting new documents needed to be indexed) and a daemon
performing the search (and possibly merging several searches). Most
search engines I use don't have a daemon for doing the searches
(instead the only provide a library), because that is seldom considered
required. Indexes are read only (then searching) so the common problems
daemons are used to solve are not present.

My solution (which took me quite a while to develop) might seem overly
complicated at first, but I think it really isn't. It was to implement
all functionality (including caching and merging of searches) in a
library. That library can be used by an application to do everything.
Or the application can use it just to contact a daemon (which of course
also uses the very same library for everything it does). This also has
the nice side effect that daemons can be chained, so searches can span
over several computers (if it supports at least one network transparent
communication mechanism). I think it would also be a good idea for the
library to support plugins for different search engines/communication
mechanisms. One of the plugins is the one using the dbus search
interface. Other plugins could be made for existing search engines like
Lucene, Swish(++|E), mnoGoSearch, Xapian, ht://Dig, Datapark,
(hyper)estraier, Glimpse, Namatzu, Sherlock Holmes and all the other.
Which would surely be a lot easier than convincing each of them to
implement a daemon which provides a dbus interface.

One thing that English users seldom consider is the usages of several
languages. Which language is being used is important to know in order
to decide what stemming rules to use, and which stop-words use (in
English "the" is a stop-word while it in Swedish means tea and is
something that is adequate to search for). People using other languages
are very often multi lingual (using English as well). Therefore it is
interesting to know which language the query is in (search engines
might also be able to translate queries to search in document written
in different languages).