simple search api (was Re: mimetype standardisation by testsets)

Thu Nov 23 17:26:31 EET 2006

2006/11/22, Magnus Bergman <magnus.bergman at observer.net>:

> I have constructed a in-house application which does pretty much
> exactly what you describe (it doesn't yet speak dbus, but corba and
> soap). Sadly I'm not allowed to release the source of this application,
> but at least I can share some of my experience. (I haven't yet looked
> closely on your source, so I might have misunderstood some things)

Great! To paraphrase Linus "Given enough eyeballs all <strike>bugs</strike>
specs are shallow"  :-)

If several search engines are available, the search manager lets the
> client know of each search engine according to your proposal (right?).
> I think it would be a better idea to present a list of indexes (of which
> each search engine might provide several) to search in, but by default
> search in all of them (if appropriate). I

Well, the search engines are not obliged to use a particular index format.
The indexes them selves can be of any format.

Instead of registering the the
> search engine I think it's better to think in terms of creating a
> session (which might still do exactly the same thing). Because this
> should affect all appropriate search engines transparently. And because
> it might be desired to alter some options for the session (language,
> fussiness, search contexts and such).

So you have a search-manager-daemon or something that holds a session object
with user info; do I understand correctly?

In addition to this session object I have found it suitable to also
> have a search object (created from a query) because applications might
> construct very complicated queries. This object can then is passed
> to countHits, and used for getting the hits. And also for getting
> attributes of the hit (matching document, score, language and such).
> (Note that a hit is not equivalent to a document.)

The problem with creating query objects like this, is that we are creating a
dbus api. Essentially you only have simple data types at your hand. No
objects - especially objects with methods on them :-) It would be possible
to create a helper lib in <insert favorite language + toolkit> to construct
queries conforming to the wasabi spec, but this would require separate libs
for gobject and qt. While this is by no means ruled out, I think we better
focus on the "bare" dbus api for now.

Daemon or no daemon, that is the question. This is a question that
> without doubt will arise (it always does). First we need to clarify that
> there is a difference between a daemon doing the indexing of document
> (or rather detecting new documents needed to be indexed) and a daemon
> performing the search (and possibly merging several searches). Most
> search engines I use don't have a daemon for doing the searches
> (instead the only provide a library), because that is seldom considered
> required. Indexes are read only (then searching) so the common problems
> daemons are used to solve are not present.

The situation at hand is that we have a  handful of desktop search engines,
all implemented as daemons, both handling searches and indexing. Having an
extra daemon on top of that handling the query one extra time before passing
it to the search subsystem seems overkill... Ideally I see the daemon/lib
(or even executable) to only be used as a means of obtaining a dbus object
path given a dbus interface name ("org.freedesktop.search.simple").

As you point out, having a separate daemon other than the indexer, is not
exactly standard (atleast not to my knowledge). Also a managing daemon is
likely to re-invent functionality dbus already provides IMHO.

My solution (which took me quite a while to develop) might seem overly
> complicated at first, but I think it really isn't. It was to implement
> all functionality (including caching and merging of searches) in a
> library. That library can be used by an application to do everything.
> Or the application can use it just to contact a daemon (which of course
> also uses the very same library for everything it does). This also has
> the nice side effect that daemons can be chained, so searches can span
> over several computers (if it supports at least one network transparent
> communication mechanism). I think it would also be a good idea for the
> library to support plugins for different search engines/communication
> mechanisms.

This is exactly what Wasabi aims to fix. Standardize  apis across search
engine implementations. What functionality should be on top of this - in
form of helper libraries/daemons should probably be punted for now (until
the api spec is set in stone atleast).

One of the plugins is the one using the dbus search
> interface. Other plugins could be made for existing search engines like
> Lucene, Swish(++|E), mnoGoSearch, Xapian, ht://Dig, Datapark,
> (hyper)estraier, Glimpse, Namatzu, Sherlock Holmes and all the other.
> Which would surely be a lot easier than convincing each of them to
> implement a daemon which provides a dbus interface.

Well, what you are suggesting sounds like the opposite of the current goal.
If I understand right you suggest creating a wrapper lib for each possible
search backend, as opposed to the current idea - to promote a shared dbus
interface. I see it this way: Dbus is the de facto standard for desktop ipc.
It is actually really easy (and portable between toolkits) to expose a dbus
api. Implementing a backend for each (custom) communications api sounds like
a great deal more work, with possibility for more bugs...

It's only guesswork, but I will bet that is hard work maintaining a cross
platform library doing all this. If we restricted to one platform it would
be another deal.

One thing that English users seldom consider is the usages of several
> languages. Which language is being used is important to know in order
> to decide what stemming rules to use, and which stop-words use (in
> English "the" is a stop-word while it in Swedish means tea and is
> something that is adequate to search for). People using other languages
> are very often multi lingual (using English as well). Therefore it is
> interesting to know which language the query is in (search engines
> might also be able to translate queries to search in document written
> in different languages).
>

This is a good point. However I suggest leaving this up to the actual
implementations. After all it is an indexing time question what stemmer to
use when indexing a document...

Sorry for the late reply. I have been rather busy lately.

Cheers,
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20061123/ceada265/attachment.htm