simple search api (was Re: mimetype standardisation by testsets)

Sat Nov 25 22:49:22 EET 2006

2006/11/24, Magnus Bergman <magnus.bergman at observer.net>:
>
> On Thu, 23 Nov 2006 16:26:31 +0100
> "Mikkel Kamstrup Erlandsen" <mikkel.kamstrup at gmail.com> wrote:
>
> > 2006/11/22, Magnus Bergman <magnus.bergman at observer.net>:
> >
> > > I have constructed a in-house application which does pretty much
> > > exactly what you describe (it doesn't yet speak dbus, but corba and
> > > soap). Sadly I'm not allowed to release the source of this
> > > application, but at least I can share some of my experience. (I
> > > haven't yet looked closely on your source, so I might have
> > > misunderstood some things)
> >
> >
> > Great! To paraphrase Linus "Given enough eyeballs all
> > <strike>bugs</strike> specs are shallow"  :-)
> >
> >
> > If several search engines are available, the search manager lets the
> > > client know of each search engine according to your proposal
> > > (right?). I think it would be a better idea to present a list of
> > > indexes (of which each search engine might provide several) to
> > > search in, but by default search in all of them (if appropriate). I
> >
> > Well, the search engines are not obliged to use a particular index
> > format. The indexes them selves can be of any format.
>
> With "index" I mean an abstract reference to something considered an
> index by the backend. With the consequence that the user (or rather the
> application) only sees the indexes, not the engines that provides them
> (because that is not very important).

Ok,  I'm with you now :-) It is  the same as the "group" switch of the
current draft on the wiki. Fx. searching for "magnus group:contacts"
searches only through the contacts "index". I'm very strongly in favor of
this, although some have spoken for putting this "grouping" functionality on
the client side. An example of client side grouping could be a music
application, where searching for "foo fighters" would add "mime:audio/*" to
the query before sending it. As I said I'm not for client side grouping, a
server side grouping could still facilitate a client side grouping anyway.

> > Instead of registering the the
> > > search engine I think it's better to think in terms of creating a
> > > session (which might still do exactly the same thing). Because this
> > > should affect all appropriate search engines transparently. And
> > > because it might be desired to alter some options for the session
> > > (language, fussiness, search contexts and such).
> >
> > So you have a search-manager-daemon or something that holds a session
> > object with user info; do I understand correctly?
>
> It is handled by a library, which can be used by a daemon. So you have
> a choice if you want to use the library directly or talk to the daemon.
>
> > > In addition to this session object I have found it suitable to also
> > > have a search object (created from a query) because applications
> > > might construct very complicated queries. This object can then is
> > > passed to countHits, and used for getting the hits. And also for
> > > getting attributes of the hit (matching document, score, language
> > > and such). (Note that a hit is not equivalent to a document.)
> >
> > The problem with creating query objects like this, is that we are
> > creating a dbus api. Essentially you only have simple data types at
> > your hand. No objects - especially objects with methods on them :-)
> > It would be possible to create a helper lib in <insert favorite
> > language + toolkit> to construct queries conforming to the wasabi
> > spec, but this would require separate libs for gobject and qt. While
> > this is by no means ruled out, I think we better focus on the "bare"
> > dbus api for now.
>
> I must admit that I have no experience of the dbus API. But it has to
> be possible to get some type of session handle (perhaps a unique number)
> back, right? Perhaps I'm confusing then using the word "object", but
> for me it's more a way of thinking, not about using object oriented
> languages (I'm using plain C then dealing with my "objects" anyway).
>
> What's this wasabi spec? Could you please direct me to it?
>
> > > Daemon or no daemon, that is the question. This is a question that
> > > without doubt will arise (it always does). First we need to clarify
> > > that there is a difference between a daemon doing the indexing of
> > > document (or rather detecting new documents needed to be indexed)
> > > and a daemon performing the search (and possibly merging several
> > > searches). Most search engines I use don't have a daemon for doing
> > > the searches (instead the only provide a library), because that is
> > > seldom considered required. Indexes are read only (then searching)
> > > so the common problems daemons are used to solve are not present.
> >
> > The situation at hand is that we have a  handful of desktop search
> > engines, all implemented as daemons, both handling searches and
> > indexing. Having an extra daemon on top of that handling the query
> > one extra time before passing it to the search subsystem seems
> > overkill... Ideally I see the daemon/lib (or even executable) to only
> > be used as a means of obtaining a dbus object path given a dbus
> > interface name ("org.freedesktop.search.simple").
>
> I have some experience of search engines in general (and I have no idea
> in what way a "desktop search engine" is different). And to my
> knowledge the majority does not have a daemon performing the searches,
> rather a library. They might have a daemon doing the indexing (and
> detecting new documents), but that's not the same thing.

Well, having a lib with no daemon associated used for searching is possible,
there's still response time  to consider. I don't know about other indexers
but I would hate to create a new Lucene IndexSearcher for each app that want
to do searches, this is a costly affair timewise and memorywise. A daemon
holding a singleton IndexSearcher (or managed pool) can be more resource
friendly here.

Has anyone thought about having a general purpose naming service based
> on dbus and avahi (like CORBAs CosNaming)? Or is there already something
> like that, that I have missed?

I believe you are asking about dbus activation?
http://raphael.slinckx.net/blog/documents/dbus-tutorial/
I don't know what CosNaming is about...

> As you point out, having a separate daemon other than the indexer, is
> > not exactly standard (atleast not to my knowledge). Also a managing
> > daemon is likely to re-invent functionality dbus already provides
> > IMHO.
>
> That might very well be the case. As I mentioned I have thou
> implemented a (in-house) daemon doing this. But it's usage is mainly to
> cache searched made through a web-interface (which has to be stateless
> since there are several web-servers sharing the load).
>
> But is your idea not to use dbus at all then (except for finding the
> search services), but a library instead?

The idea is to have the indexer/search engine expose the wasabi api over
dbus.

> > My solution (which took me quite a while to develop) might seem
> > > overly complicated at first, but I think it really isn't. It was to
> > > implement all functionality (including caching and merging of
> > > searches) in a library. That library can be used by an application
> > > to do everything. Or the application can use it just to contact a
> > > daemon (which of course also uses the very same library for
> > > everything it does). This also has the nice side effect that
> > > daemons can be chained, so searches can span over several computers
> > > (if it supports at least one network transparent communication
> > > mechanism). I think it would also be a good idea for the library to
> > > support plugins for different search engines/communication
> > > mechanisms.
> >
> > This is exactly what Wasabi aims to fix. Standardize  apis across
> > search engine implementations. What functionality should be on top of
> > this - in form of helper libraries/daemons should probably be punted
> > for now (until the api spec is set in stone atleast).
>
> I very much doubt that many search engines implementation will adapt to
> such a standard. Or is the idea to only use the search engines that
> does comply? I think that would be a waste, and possibly a problem in
> the long run. For example, I'm sure that the Russian search engines
> are superior then it comes to Russian stemming, and that Japanese
> search engines are better suited for Japanese users. Implementing
> support for those might not be in top of the list right now, but I
> think it's important to keep an open design. The spell checker enchant
> is a good example.
>
> But sure, specifying an API is a good thing. And I'm sure the same spec
> can easily be translated into a dbus interface (and perhaps a plugin
> API) as well.
>
> > > One of the plugins is the one using the dbus search
> > > interface. Other plugins could be made for existing search engines
> > > like Lucene, Swish(++|E), mnoGoSearch, Xapian, ht://Dig, Datapark,
> > > (hyper)estraier, Glimpse, Namatzu, Sherlock Holmes and all the
> > > other. Which would surely be a lot easier than convincing each of
> > > them to implement a daemon which provides a dbus interface.
> >
> > Well, what you are suggesting sounds like the opposite of the current
> > goal. If I understand right you suggest creating a wrapper lib for
> > each possible search backend, as opposed to the current idea - to
> > promote a shared dbus interface. I see it this way: Dbus is the de
> > facto standard for desktop ipc. It is actually really easy (and
> > portable between toolkits) to expose a dbus api. Implementing a
> > backend for each (custom) communications api sounds like a great deal
> > more work, with possibility for more bugs...
>
> Yes, I'm suggesting a design that makes it possible to create a wrapper
> lib for each possible search backend. But certainly not *opposed* to
> the current idea, rather *in addition* to it. If there is a possibility
> to load plugins, the dbus communication can be done with one of them.
> Other plugins doing things differently can exist in parallel (and
> doesn't have to be implemented right now either).

I'm slowly begging to see the light. This will require a lot more code for
Wasabi than I was hoping for though - however I do like the idea very much.

> It's only guesswork, but I will bet that is hard work maintaining a
> > cross platform library doing all this. If we restricted to one
> > platform it would be another deal.
>
> You mean a cross-platform library doing plugin-loading (other from that
> I don't believe my suggestion is more complicated in any way). I think
> you're right that plugin loading is tricky to port to some platforms.
> But I guess (or at least hope) that those issues are taken care of by
> gmodule. I only have experience maintaining applications for linux,
> irix and solaris. And those systems are very similar.

gmodule would make us depend on glib, which might very well pose a problem
to the KDE/QT guys among us...  We could use libltdl which is a GNU portable
dlopen wrapper that is pretty standard. See fx.
http://www.delorie.com/gnu/docs/libtool/libtool_46.html

> One thing that English users seldom consider is the usages of several
> > > languages. Which language is being used is important to know in
> > > order to decide what stemming rules to use, and which stop-words
> > > use (in English "the" is a stop-word while it in Swedish means tea
> > > and is something that is adequate to search for). People using
> > > other languages are very often multi lingual (using English as
> > > well). Therefore it is interesting to know which language the query
> > > is in (search engines might also be able to translate queries to
> > > search in document written in different languages).
> >
> > This is a good point. However I suggest leaving this up to the actual
> > implementations. After all it is an indexing time question what
> > stemmer to use when indexing a document...
>
> Yes, absolutely. But for them to do their job they might need to know
> what language the query is supposed to be in. This might be supported in
> the query language (or perhaps as an argument to the function creating
> a search from the query).
>

Well, I think the search engine could assume that the query was in the
current locale unless explicitly told otherwise through some query option?

Cheers,
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20061125/ef9065e2/attachment.htm