simple search api (was Re: mimetype standardisation by testsets)

Fri Nov 24 03:42:24 PST 2006

On Thu, 23 Nov 2006 08:33:29 +0100
"Jos van den Oever" <jvdoever at gmail.com> wrote:

> Hi Magnus,
> Great to have your opinion here.
> 
> > If several search engines are available, the search manager lets the
> > client know of each search engine according to your proposal
> > (right?).
> That's the idea. Mikkel doesn't like the idea of a daemon for keeping
> track of the search engines. We'll either have a daemon or a library.

If you consider my idea, you'll see it's a way to avoid this issue:
There will be a library (to keeping track of engines and such), which
applications can use directly if they want to (but this might add
extra responsibilities for the application). And also a daemon, which is
a quite tiny application on top of the library. You get both things
without extra cost. (And you also get the possibility to chain daemon,
but I guess that's not a topic right now.)

> > I think it would be a better idea to present a list of indexes (of
> > which each search engine might provide several) to search in, but
> > by default search in all of them (if appropriate). Instead of
> > registering the the search engine I think it's better to think in
> > terms of creating a session (which might still do exactly the same
> > thing). Because this should affect all appropriate search engines
> > transparently. And because it might be desired to alter some
> > options for the session (language, fussiness, search contexts and
> > such).
> Yes, I see your point. One application where search is important, is
> searching in a folder. If there is no index to search quickly, the
> search will have to be done by scanning the files. (Strigi has good
> programs to do this fallback efficiently, which are currently being
> sped up.)  The user should not have to care where the indexes are. For
> each search domain it should be clear what is included in the search
> and what isn't.

Yes, that could be one way to use it. Could you please clarify what a
search domain is?

> Creating a session, where some settings about the search are stored
> sound like a nice idea. It is however not much different from having
> function calls that include these parameters.

This would be the same thing as having only one session (per
dbus-connection?), for which you set these parameters right?

> > In addition to this session object I have found it suitable to also
> > have a search object (created from a query) because applications
> > might construct very complicated queries. This object can then is
> > passed to countHits, and used for getting the hits. And also for
> > getting attributes of the hit (matching document, score, language
> > and such).
> Functionally this is equivalent to the current API, except for the
> fact that the query is parsed by the client instread of the daemon.
> This has two disadvantages:
>  - all clients must be able to parse queries or at least be able to
> construct them
>  - the object will be complex and hard to describe over DBus

I mean you create a search from a query (and possibly a session)
something like this:

search_handle = search_new(session_handle,query)

This function could perhaps return before the search is fully carried
out. The point is that the (possibly big and complex) query doesn't
need to be sent more than one time. Then there is the issue of creating
a standard query language. 

> > (Note that a hit is not equivalent to a document.)
> In the API a hit is equivalent to 'something' which can be accessed
> over a URL.

Well, that is more what I call a document. A hit also has a score, some
info about why the document matches (and perhaps information about how
to view the document).

> > Daemon or no daemon, that is the question. This is a question that
> > without doubt will arise (it always does). First we need to clarify
> > that there is a difference between a daemon doing the indexing of
> > document (or rather detecting new documents needed to be indexed)
> > and a daemon performing the search (and possibly merging several
> > searches). Most search engines I use don't have a daemon for doing
> > the searches (instead the only provide a library), because that is
> > seldom considered required. Indexes are read only (then searching)
> > so the common problems daemons are used to solve are not present.
> Yes, both are good reasons for a daemon. Instant indexing, caching,
> merging and live queries all require a daemon (or temp files). You
> could shift caching, merging and live queries to the clients. Then
> you'd need to have a good library with all the required language
> bindings.

Instant indexing can be done by a separate daemon which is not at all
involved in handling queries. This is how almost every implementation
works according to my experience. A daemon sure helps for merging
searches, but it is not required. It is quite possible put this inside
each application (using a library) too. For the record, I would prefer
a daemon. But it seams embedding is very popular. Look at mysql for
example.

Or do you perhaps by "live queries" refer to queries that are stored
and will result in a hit as soon as some matching document turns up
some time in the future? (Something that is often called search
filters.)

> > My solution (which took me quite a while to develop) might seem
> > overly complicated at first, but I think it really isn't. It was to
> > implement all functionality (including caching and merging of
> > searches) in a library. That library can be used by an application
> > to do everything.
> This would mean you have many different caches with a resulting memory
> overhead. Also the language binding problem is still there. The nice
> thing about DBus is the automatic language binding it entails. Of
> course for the live queries, as long as the simple API does not have
> them, you can let the client poll. With good caching on the client
> side, this should not cause a large overhead.

Yes, it sure can be implemented in an inefficient, memory consuming
way. But it can be implemented in a sensible way too. The way I
implemented it, it works like this:

Each plugin tells the library what it's capable of (including caching),
and then the application tells the library what it can handle by itself
(by default nothing). And then the library does everything that no-one
else does.

Please note that my idea of a possibility to use a library directly by
no means hinders the use of a daemon. The whole point is that you get
both. And best of all, at virtually no additional cost since everything
is implemented in the same library. All this becomes possible:

 ,-------------,
 | Application |
 `----.--------'
      |
   protocol
      |
  ,---^-----. ,-------------.
  | Daemon  | | Application |
  >---------^-^-------------<
  |         library         |
  >--------.-----.----------<
  | engine |     | protocol |
  | plugin |     | plugin   |
  >--------<     `----.-----'
  | search |          |
  | engine |       protocol
  `--------'          |
                  ,---^----,
                  | search |
                  | daemon |
                  `--------'

> > Or the application can use it just to contact a daemon (which of
> > course also uses the very same library for everything it does).
> > This also has the nice side effect that daemons can be chained, so
> > searches can span over several computers (if it supports at least
> > one network transparent communication mechanism). I think it would
> > also be a good idea for the library to support plugins for
> > different search engines/communication mechanisms. One of the
> > plugins is the one using the dbus search interface. Other plugins
> > could be made for existing search engines like Lucene, Swish(++|E),
> > mnoGoSearch, Xapian, ht://Dig, Datapark, (hyper)estraier, Glimpse,
> > Namatzu, Sherlock Holmes and all the other. Which would surely be a
> > lot easier than convincing each of them to implement a daemon which
> > provides a dbus interface.
> Yes, this is all very nice and can be implemented by writing a DBus /
> search engine interface for these programs.

I personally think it's slightly easier to create a thin wrapper in form
of a plugin, than to create a search server exporting a dbus interface.
This question is probably mostly about which project has the do the job.

By the way, I'm not at all against specifying a dbus interface (if it
looks that way). I think it's a great idea. But I also think it's
important to look at the big picture. And even more important not
making any unnecessary assumptions, which might limit the usefulness in
the future.