simple search api (was Re: mimetype standardisation by testsets)

Thu Nov 23 21:44:28 EET 2006

2006/11/23, Fabrice Colin <fabrice.colin at gmail.com>:
>
> Hello all,
>
> On 11/23/06, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> > 2006/11/22, Magnus Bergman <magnus.bergman at observer.net>:
> > > If several search engines are available, the search manager lets the
> > > client know of each search engine according to your proposal (right?).
> > > I think it would be a better idea to present a list of indexes (of
> which
> > > each search engine might provide several) to search in, but by default
> > > search in all of them (if appropriate). I
> >
> > Well, the search engines are not obliged to use a particular index
> format.
> > The indexes them selves can be of any format.
> >
> What Magnus suggests may be useful for document 'sources' or 'groups' (for
> lack of a better name), eg "Documents", "Applications", "Contacts",
> "Conversations" etc... -as offered by some existing personal search
> systems-
> which may or may not map to individual indexes (that mapping being
> irrelevant).

That was exactly what I meant to cover with the "group" switch. Fx. the
query "fabrice group:contacts" would return you. Searching without a
specified group would return matches from all groups. Perhaps the wiki is a
bit unclear here...

> > In addition to this session object I have found it suitable to also
> > > have a search object (created from a query) because applications might
> > > construct very complicated queries. This object can then is passed
> > > to countHits, and used for getting the hits. And also for getting
> > > attributes of the hit (matching document, score, language and such).
> > > (Note that a hit is not equivalent to a document.)
> >
> > The problem with creating query objects like this, is that we are
> creating a
> > dbus api. Essentially you only have simple data types at your hand. No
> > objects - especially objects with methods on them :-) It would be
> possible
> > to create a helper lib in <insert favorite language + toolkit> to
> construct
> > queries conforming to the wasabi spec, but this would require separate
> libs
> > for gobject and qt. While this is by no means ruled out, I think we
> better
> > focus on the "bare" dbus api for now.
> >

Well, AFAIK, dbus allows complex structures like arrays or dictionaries.

Yeah, but that really only accounts as collections of simple data types in
my book. What I meant was just that you can't have Query object, like fx
Lucene does, and pass that over the wire. Not in a desktop neutral way at
least - or please correct me if I'm wrong! :-)

> The situation at hand is that we have a  handful of desktop search
> engines,
> > all implemented as daemons, both handling searches and indexing. Having
> an
> > extra daemon on top of that handling the query one extra time before
> passing
> > it to the search subsystem seems overkill... Ideally I see the
> daemon/lib
> > (or even executable) to only be used as a means of obtaining a dbus
> object
> > path given a dbus interface name (" org.freedesktop.search.simple").
> >
> Agreed. The daemon's role would probably also include filtering out search
> services based on user preferences, wouldn't it ?

Yeah, that was my idea atleast. Perform a selection  based on some sane
criterias (read: user configuration). My idea was that the api consumer only
needed to call getInterfaceProvider("org.freedesktop.search.simple") and
then get one object path back to use for the dbus connection.

> > One thing that English users seldom consider is the usages of several
> > > languages. Which language is being used is important to know in order
> > > to decide what stemming rules to use, and which stop-words use (in
> > > English "the" is a stop-word while it in Swedish means tea and is
> > > something that is adequate to search for). People using other
> languages
> > > are very often multi lingual (using English as well). Therefore it is
> > > interesting to know which language the query is in (search engines
> > > might also be able to translate queries to search in document written
> > > in different languages).
> > >
> >
> > This is a good point. However I suggest leaving this up to the actual
> > implementations. After all it is an indexing time question what stemmer
> to
> > use when indexing a document...
> >
> The language is also useful at query time for the query to be parsed &
> tokenized
> in a way that's consistent with how documents text was at indexing time.
> For instance, if the query is in English -as Magnus points out- you may
> want to
> remove English stopwords, run an English stemmer on terms, or even limit
> the
> search to documents that were detected as being in English at indexing
> time.

Right you are. I was a bit wasted last night when I  answered Magnus (sorry)
- I just thought her deserved an answer sooner rather than later.

The question is then if this info should be stored in  the manager daemon or
the search engine. As I consider it more or less a design goal that the
daemon (or lib or what ever we end up with), should be expendable, I don't
think such info should lie with the managing object. Also if this info would
reside with the managing object that would also mean each query should go
through the managing interface, and I don't think I'm totally hooked on that
idea.

To avoid code duplication we could develop a small lib or other dbus service
to *optionally* handle these issues. I'm reluctant to impose any dependency
on the implementing engines.

Cheers,
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20061123/8a9dc4e6/attachment.htm