simple search api (was Re: mimetype standardisation by testsets)
fabrice.colin at gmail.com
Thu Nov 23 18:05:53 EET 2006
On 11/23/06, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> 2006/11/22, Magnus Bergman <magnus.bergman at observer.net>:
> > If several search engines are available, the search manager lets the
> > client know of each search engine according to your proposal (right?).
> > I think it would be a better idea to present a list of indexes (of which
> > each search engine might provide several) to search in, but by default
> > search in all of them (if appropriate). I
> Well, the search engines are not obliged to use a particular index format.
> The indexes them selves can be of any format.
What Magnus suggests may be useful for document 'sources' or 'groups' (for
lack of a better name), eg "Documents", "Applications", "Contacts",
"Conversations" etc... -as offered by some existing personal search systems-
which may or may not map to individual indexes (that mapping being irrelevant).
> > In addition to this session object I have found it suitable to also
> > have a search object (created from a query) because applications might
> > construct very complicated queries. This object can then is passed
> > to countHits, and used for getting the hits. And also for getting
> > attributes of the hit (matching document, score, language and such).
> > (Note that a hit is not equivalent to a document.)
> The problem with creating query objects like this, is that we are creating a
> dbus api. Essentially you only have simple data types at your hand. No
> objects - especially objects with methods on them :-) It would be possible
> to create a helper lib in <insert favorite language + toolkit> to construct
> queries conforming to the wasabi spec, but this would require separate libs
> for gobject and qt. While this is by no means ruled out, I think we better
> focus on the "bare" dbus api for now.
Well, AFAIK, dbus allows complex structures like arrays or dictionaries.
> The situation at hand is that we have a handful of desktop search engines,
> all implemented as daemons, both handling searches and indexing. Having an
> extra daemon on top of that handling the query one extra time before passing
> it to the search subsystem seems overkill... Ideally I see the daemon/lib
> (or even executable) to only be used as a means of obtaining a dbus object
> path given a dbus interface name (" org.freedesktop.search.simple").
Agreed. The daemon's role would probably also include filtering out search
services based on user preferences, wouldn't it ?
> > One thing that English users seldom consider is the usages of several
> > languages. Which language is being used is important to know in order
> > to decide what stemming rules to use, and which stop-words use (in
> > English "the" is a stop-word while it in Swedish means tea and is
> > something that is adequate to search for). People using other languages
> > are very often multi lingual (using English as well). Therefore it is
> > interesting to know which language the query is in (search engines
> > might also be able to translate queries to search in document written
> > in different languages).
> This is a good point. However I suggest leaving this up to the actual
> implementations. After all it is an indexing time question what stemmer to
> use when indexing a document...
The language is also useful at query time for the query to be parsed & tokenized
in a way that's consistent with how documents text was at indexing time.
For instance, if the query is in English -as Magnus points out- you may want to
remove English stopwords, run an English stemmer on terms, or even limit the
search to documents that were detected as being in English at indexing time.
More information about the xdg