simple search api (was Re: mimetype standardisation by testsets)

Mon Nov 27 17:07:38 EET 2006

On Sat, 25 Nov 2006 21:49:22 +0100
"Mikkel Kamstrup Erlandsen" <mikkel.kamstrup at gmail.com> wrote:

> 2006/11/24, Magnus Bergman <magnus.bergman at observer.net>:
> >
> > On Thu, 23 Nov 2006 16:26:31 +0100
> > "Mikkel Kamstrup Erlandsen" <mikkel.kamstrup at gmail.com> wrote:
> >
> > > 2006/11/22, Magnus Bergman <magnus.bergman at observer.net>:
> > >
> > > > I have constructed a in-house application which does pretty much
> > > > exactly what you describe (it doesn't yet speak dbus, but corba
> > > > and soap). Sadly I'm not allowed to release the source of this
> > > > application, but at least I can share some of my experience. (I
> > > > haven't yet looked closely on your source, so I might have
> > > > misunderstood some things)
> > >
> > >
> > > Great! To paraphrase Linus "Given enough eyeballs all
> > > <strike>bugs</strike> specs are shallow"  :-)
> > >
> > >
> > > If several search engines are available, the search manager lets
> > > the
> > > > client know of each search engine according to your proposal
> > > > (right?). I think it would be a better idea to present a list of
> > > > indexes (of which each search engine might provide several) to
> > > > search in, but by default search in all of them (if
> > > > appropriate). I
> > >
> > > Well, the search engines are not obliged to use a particular index
> > > format. The indexes them selves can be of any format.
> >
> > With "index" I mean an abstract reference to something considered an
> > index by the backend. With the consequence that the user (or rather
> > the application) only sees the indexes, not the engines that
> > provides them (because that is not very important).
> 
> Ok,  I'm with you now :-) It is  the same as the "group" switch of the
> current draft on the wiki. Fx. searching for "magnus group:contacts"
> searches only through the contacts "index". I'm very strongly in
> favor of this, although some have spoken for putting this "grouping"
> functionality on the client side. An example of client side grouping
> could be a music application, where searching for "foo fighters"
> would add "mime:audio/*" to the query before sending it. As I said
> I'm not for client side grouping, a server side grouping could still
> facilitate a client side grouping anyway.

My idea of "index" was a more abstract alternative to "search engine"
or "backend" (since several of those can run and their search results
be merged I assume). If one single search engine/backend has several
indexes I thought it could be of reasons like that the indexes are
created by different users (one for each user and one for system files
like man-pages perhaps) or reside on different computers. But this is
probably beyond the scope of the simple interface, which should just
trust that the appropriate indexes are searched (and that the
appropriate search engines/backends are used).

I'm not sure I understand exactly what "group" means the draft. It is
rather some predefined or user defined categories files are sorted
under automatically or manually by the user. Some kind of tags to
categorize data?

> > > > Daemon or no daemon, that is the question. This is a question
> > > > that without doubt will arise (it always does). First we need
> > > > to clarify that there is a difference between a daemon doing
> > > > the indexing of document (or rather detecting new documents
> > > > needed to be indexed) and a daemon performing the search (and
> > > > possibly merging several searches). Most search engines I use
> > > > don't have a daemon for doing the searches (instead the only
> > > > provide a library), because that is seldom considered required.
> > > > Indexes are read only (then searching) so the common problems
> > > > daemons are used to solve are not present.
> > >
> > > The situation at hand is that we have a  handful of desktop search
> > > engines, all implemented as daemons, both handling searches and
> > > indexing. Having an extra daemon on top of that handling the query
> > > one extra time before passing it to the search subsystem seems
> > > overkill... Ideally I see the daemon/lib (or even executable) to
> > > only be used as a means of obtaining a dbus object path given a
> > > dbus interface name ("org.freedesktop.search.simple").
> >
> > I have some experience of search engines in general (and I have no
> > idea in what way a "desktop search engine" is different). And to my
> > knowledge the majority does not have a daemon performing the
> > searches, rather a library. They might have a daemon doing the
> > indexing (and detecting new documents), but that's not the same
> > thing.
> 
> Well, having a lib with no daemon associated used for searching is
> possible, there's still response time  to consider. I don't know
> about other indexers but I would hate to create a new Lucene
> IndexSearcher for each app that want to do searches, this is a costly
> affair timewise and memorywise. A daemon holding a singleton
> IndexSearcher (or managed pool) can be more resource friendly here.

I assume the situation is pretty much the same as with (SQL) databases.
For example MySQL works fine as a daemon and embedded as a library too.
But in this case it is less complicated since the library only reads
from the index(es). I don't have much experience with lucene (it was
years since I even looked at it). So I'm sorry I don't know creating a
new Lucene IndexSearcher involves. But I assume it means initiating the
engine. So it's obviously faster to do that only once (which is the
same with MySQL, still people insist on embedding it). So my suggestion
is this:

1 Applications can talk with the daemon using the protocol directly like
this:

 ,-------------,
 | Application |
 `----.--------'
      |
   protocol
      |
  ,---^-----.
  | Daemon  |
  >---------<
  | library |
  >---------<
  | backend |
  | plugin  |
  `---------'

2 Applications can use the library to either communicate with the
daemon or loading the backend plugin directly, like this

  ,---------------------.
  |     Application     |
  >---------------------<
  |       library       |
  >--------.-.----------<
  | engine | | protocol |
  | plugin | | plugin   |
  >--------< `----.-----'
  | search |      |
  | engine |   protocol
  `--------'      |
              ,---^----,
              | daemon |
              `--------'

If the library is used, then the library decides which "path" to take
to each index (if using the daemon to a certain backend if more
efficient, then the daemon will be used if it's available). And Since
the daemon uses the very same library to to the very same thing it must
of course be smart enough to not create an infinite loop by contacting
itself (I'll explain that in more detail if required).

> Has anyone thought about having a general purpose naming service based
> > on dbus and avahi (like CORBAs CosNaming)? Or is there already
> > something like that, that I have missed?
> 
> I believe you are asking about dbus activation?
> http://raphael.slinckx.net/blog/documents/dbus-tutorial/
> I don't know what CosNaming is about...

No, not really. I was thinking the other way around. When services
becomes available the register themselves with a naming service,
telling what service it is they provide and how to find them. In
other words what avahi (dns-sd) does. But not requiring the the service
to have a IANA registered TCP port. And without the text length
limitation of dns-sd.

> > As you point out, having a separate daemon other than the indexer,
> > is
> > > not exactly standard (atleast not to my knowledge). Also a
> > > managing daemon is likely to re-invent functionality dbus already
> > > provides IMHO.
> >
> > That might very well be the case. As I mentioned I have thou
> > implemented a (in-house) daemon doing this. But it's usage is
> > mainly to cache searched made through a web-interface (which has to
> > be stateless since there are several web-servers sharing the load).
> >
> > But is your idea not to use dbus at all then (except for finding the
> > search services), but a library instead?
> 
> The idea is to have the indexer/search engine expose the wasabi api
> over dbus.

But the wasabi api itself does not use dbus, right? It is rather a
library wrapping some sort of ipc mechanism provided be the search
engine?

> > > > One of the plugins is the one using the dbus search
> > > > interface. Other plugins could be made for existing search
> > > > engines like Lucene, Swish(++|E), mnoGoSearch, Xapian,
> > > > ht://Dig, Datapark, (hyper)estraier, Glimpse, Namatzu, Sherlock
> > > > Holmes and all the other. Which would surely be a lot easier
> > > > than convincing each of them to implement a daemon which
> > > > provides a dbus interface.
> > >
> > > Well, what you are suggesting sounds like the opposite of the
> > > current goal. If I understand right you suggest creating a
> > > wrapper lib for each possible search backend, as opposed to the
> > > current idea - to promote a shared dbus interface. I see it this
> > > way: Dbus is the de facto standard for desktop ipc. It is
> > > actually really easy (and portable between toolkits) to expose a
> > > dbus api. Implementing a backend for each (custom) communications
> > > api sounds like a great deal more work, with possibility for more
> > > bugs...
> >
> > Yes, I'm suggesting a design that makes it possible to create a
> > wrapper lib for each possible search backend. But certainly not
> > *opposed* to the current idea, rather *in addition* to it. If there
> > is a possibility to load plugins, the dbus communication can be
> > done with one of them. Other plugins doing things differently can
> > exist in parallel (and doesn't have to be implemented right now
> > either).
> 
> I'm slowly begging to see the light. This will require a lot more
> code for Wasabi than I was hoping for though - however I do like the
> idea very much.

Hmm... I guess a typo made that sentence quite ambiguous. Are you
beginning to see the light or begging me to see the light?

> > It's only guesswork, but I will bet that is hard work maintaining a
> > > cross platform library doing all this. If we restricted to one
> > > platform it would be another deal.
> >
> > You mean a cross-platform library doing plugin-loading (other from
> > that I don't believe my suggestion is more complicated in any way).
> > I think you're right that plugin loading is tricky to port to some
> > platforms. But I guess (or at least hope) that those issues are
> > taken care of by gmodule. I only have experience maintaining
> > applications for linux, irix and solaris. And those systems are
> > very similar.
> 
> gmodule would make us depend on glib, which might very well pose a
> problem to the KDE/QT guys among us...  We could use libltdl which is
> a GNU portable dlopen wrapper that is pretty standard. See fx.
> http://www.delorie.com/gnu/docs/libtool/libtool_46.html

No problem then. =)

> > One thing that English users seldom consider is the usages of
> > several
> > > > languages. Which language is being used is important to know in
> > > > order to decide what stemming rules to use, and which stop-words
> > > > use (in English "the" is a stop-word while it in Swedish means
> > > > tea and is something that is adequate to search for). People
> > > > using other languages are very often multi lingual (using
> > > > English as well). Therefore it is interesting to know which
> > > > language the query is in (search engines might also be able to
> > > > translate queries to search in document written in different
> > > > languages).
> > >
> > > This is a good point. However I suggest leaving this up to the
> > > actual implementations. After all it is an indexing time question
> > > what stemmer to use when indexing a document...
> >
> > Yes, absolutely. But for them to do their job they might need to
> > know what language the query is supposed to be in. This might be
> > supported in the query language (or perhaps as an argument to the
> > function creating a search from the query).
> 
> Well, I think the search engine could assume that the query was in the
> current locale unless explicitly told otherwise through some query
> option?

Yes.