simple search api (was Re: mimetype standardisation by testsets)

Fri Nov 24 14:55:25 EET 2006

On Thu, 23 Nov 2006 16:26:31 +0100
"Mikkel Kamstrup Erlandsen" <mikkel.kamstrup at gmail.com> wrote:

> 2006/11/22, Magnus Bergman <magnus.bergman at observer.net>:
> 
> > I have constructed a in-house application which does pretty much
> > exactly what you describe (it doesn't yet speak dbus, but corba and
> > soap). Sadly I'm not allowed to release the source of this
> > application, but at least I can share some of my experience. (I
> > haven't yet looked closely on your source, so I might have
> > misunderstood some things)
> 
> 
> Great! To paraphrase Linus "Given enough eyeballs all
> <strike>bugs</strike> specs are shallow"  :-)
> 
> 
> If several search engines are available, the search manager lets the
> > client know of each search engine according to your proposal
> > (right?). I think it would be a better idea to present a list of
> > indexes (of which each search engine might provide several) to
> > search in, but by default search in all of them (if appropriate). I
> 
> Well, the search engines are not obliged to use a particular index
> format. The indexes them selves can be of any format.

With "index" I mean an abstract reference to something considered an
index by the backend. With the consequence that the user (or rather the
application) only sees the indexes, not the engines that provides them
(because that is not very important).

> > Instead of registering the the
> > search engine I think it's better to think in terms of creating a
> > session (which might still do exactly the same thing). Because this
> > should affect all appropriate search engines transparently. And
> > because it might be desired to alter some options for the session
> > (language, fussiness, search contexts and such).
> 
> So you have a search-manager-daemon or something that holds a session
> object with user info; do I understand correctly?

It is handled by a library, which can be used by a daemon. So you have
a choice if you want to use the library directly or talk to the daemon.

> > In addition to this session object I have found it suitable to also
> > have a search object (created from a query) because applications
> > might construct very complicated queries. This object can then is
> > passed to countHits, and used for getting the hits. And also for
> > getting attributes of the hit (matching document, score, language
> > and such). (Note that a hit is not equivalent to a document.)
> 
> The problem with creating query objects like this, is that we are
> creating a dbus api. Essentially you only have simple data types at
> your hand. No objects - especially objects with methods on them :-)
> It would be possible to create a helper lib in <insert favorite
> language + toolkit> to construct queries conforming to the wasabi
> spec, but this would require separate libs for gobject and qt. While
> this is by no means ruled out, I think we better focus on the "bare"
> dbus api for now.

I must admit that I have no experience of the dbus API. But it has to
be possible to get some type of session handle (perhaps a unique number)
back, right? Perhaps I'm confusing then using the word "object", but
for me it's more a way of thinking, not about using object oriented
languages (I'm using plain C then dealing with my "objects" anyway).

What's this wasabi spec? Could you please direct me to it?

> > Daemon or no daemon, that is the question. This is a question that
> > without doubt will arise (it always does). First we need to clarify
> > that there is a difference between a daemon doing the indexing of
> > document (or rather detecting new documents needed to be indexed)
> > and a daemon performing the search (and possibly merging several
> > searches). Most search engines I use don't have a daemon for doing
> > the searches (instead the only provide a library), because that is
> > seldom considered required. Indexes are read only (then searching)
> > so the common problems daemons are used to solve are not present.
> 
> The situation at hand is that we have a  handful of desktop search
> engines, all implemented as daemons, both handling searches and
> indexing. Having an extra daemon on top of that handling the query
> one extra time before passing it to the search subsystem seems
> overkill... Ideally I see the daemon/lib (or even executable) to only
> be used as a means of obtaining a dbus object path given a dbus
> interface name ("org.freedesktop.search.simple").

I have some experience of search engines in general (and I have no idea
in what way a "desktop search engine" is different). And to my
knowledge the majority does not have a daemon performing the searches,
rather a library. They might have a daemon doing the indexing (and
detecting new documents), but that's not the same thing.

Has anyone thought about having a general purpose naming service based
on dbus and avahi (like CORBAs CosNaming)? Or is there already something
like that, that I have missed?

> As you point out, having a separate daemon other than the indexer, is
> not exactly standard (atleast not to my knowledge). Also a managing
> daemon is likely to re-invent functionality dbus already provides
> IMHO.

That might very well be the case. As I mentioned I have thou
implemented a (in-house) daemon doing this. But it's usage is mainly to
cache searched made through a web-interface (which has to be stateless
since there are several web-servers sharing the load).

But is your idea not to use dbus at all then (except for finding the
search services), but a library instead?

> > My solution (which took me quite a while to develop) might seem 
> > overly complicated at first, but I think it really isn't. It was to
> > implement all functionality (including caching and merging of
> > searches) in a library. That library can be used by an application
> > to do everything. Or the application can use it just to contact a
> > daemon (which of course also uses the very same library for
> > everything it does). This also has the nice side effect that
> > daemons can be chained, so searches can span over several computers
> > (if it supports at least one network transparent communication
> > mechanism). I think it would also be a good idea for the library to
> > support plugins for different search engines/communication
> > mechanisms.
> 
> This is exactly what Wasabi aims to fix. Standardize  apis across
> search engine implementations. What functionality should be on top of
> this - in form of helper libraries/daemons should probably be punted
> for now (until the api spec is set in stone atleast).

I very much doubt that many search engines implementation will adapt to
such a standard. Or is the idea to only use the search engines that
does comply? I think that would be a waste, and possibly a problem in
the long run. For example, I'm sure that the Russian search engines
are superior then it comes to Russian stemming, and that Japanese
search engines are better suited for Japanese users. Implementing
support for those might not be in top of the list right now, but I
think it's important to keep an open design. The spell checker enchant
is a good example.

But sure, specifying an API is a good thing. And I'm sure the same spec
can easily be translated into a dbus interface (and perhaps a plugin
API) as well.

> > One of the plugins is the one using the dbus search
> > interface. Other plugins could be made for existing search engines
> > like Lucene, Swish(++|E), mnoGoSearch, Xapian, ht://Dig, Datapark,
> > (hyper)estraier, Glimpse, Namatzu, Sherlock Holmes and all the
> > other. Which would surely be a lot easier than convincing each of
> > them to implement a daemon which provides a dbus interface.
> 
> Well, what you are suggesting sounds like the opposite of the current
> goal. If I understand right you suggest creating a wrapper lib for
> each possible search backend, as opposed to the current idea - to
> promote a shared dbus interface. I see it this way: Dbus is the de
> facto standard for desktop ipc. It is actually really easy (and
> portable between toolkits) to expose a dbus api. Implementing a
> backend for each (custom) communications api sounds like a great deal
> more work, with possibility for more bugs...

Yes, I'm suggesting a design that makes it possible to create a wrapper
lib for each possible search backend. But certainly not *opposed* to
the current idea, rather *in addition* to it. If there is a possibility
to load plugins, the dbus communication can be done with one of them.
Other plugins doing things differently can exist in parallel (and
doesn't have to be implemented right now either).

> It's only guesswork, but I will bet that is hard work maintaining a
> cross platform library doing all this. If we restricted to one
> platform it would be another deal.

You mean a cross-platform library doing plugin-loading (other from that
I don't believe my suggestion is more complicated in any way). I think
you're right that plugin loading is tricky to port to some platforms.
But I guess (or at least hope) that those issues are taken care of by
gmodule. I only have experience maintaining applications for linux,
irix and solaris. And those systems are very similar.

> One thing that English users seldom consider is the usages of several
> > languages. Which language is being used is important to know in
> > order to decide what stemming rules to use, and which stop-words
> > use (in English "the" is a stop-word while it in Swedish means tea
> > and is something that is adequate to search for). People using
> > other languages are very often multi lingual (using English as
> > well). Therefore it is interesting to know which language the query
> > is in (search engines might also be able to translate queries to
> > search in document written in different languages).
> 
> This is a good point. However I suggest leaving this up to the actual
> implementations. After all it is an indexing time question what
> stemmer to use when indexing a document...

Yes, absolutely. But for them to do their job they might need to know
what language the query is supposed to be in. This might be supported in
the query language (or perhaps as an argument to the function creating
a search from the query).