simple search api (was Re: mimetype standardisation by testsets)

Mon Nov 27 18:25:16 EET 2006

On Fri, 24 Nov 2006 15:28:18 +0100
Jean-Francois Dockes <jean-francois.dockes at wanadoo.fr> wrote:

> Magnus Bergman writes:
>  > On Fri, 24 Nov 2006 12:25:41 +0100
>  > Jean-Francois Dockes <jean-francois.dockes at wanadoo.fr> wrote:
>  > >
>  > > [lots of query language ramblings]
>  > 
>  > I agree with everything above. By the way, it might also be useful
>  > to be able to add weight to the sub-queries.
> 
> Something easily done with an xml attribute I believe, which could
> just as easily be ignored by a backend which would not support it.

Yes.

>  > > - Documents and files are not the same thing (think email message
>  > > inside an Inbox, Knotes). Both have their uses on the client side
>  > > though (document identifier to request a snippet, or a text
>  > > preview, file to, well, do something with the file). I don't
>  > > know of a standard way to designate a message inside an mbox
>  > > file, this is a tricky issue. We can probably see the document
>  > > identifier as opaque, and interpreted only in the backend. The
>  > > file identifier needs to be visible. Or is there a standard way
>  > > to separate the File and Subdoc parts in what the draft calls
>  > > uris ?
>  > 
>  > Are you also thinking the problem of presenting the right (virtual)
>  > document to the user? Having an opaque identifier for the document
>  > is a good idea. But this requires that the backend also knows
>  > about how to create something the user can view out of this
>  > identifier, otherwise it's not of much use. The indexer always has
>  > some kind of filter which at least turns the document into a
>  > stream of words. Are these thoughts perhaps beyond the scope of
>  > problem discussed?
> 
> Yes, in my view, we need two pieces of information about documents
> like email messages:
>  - An opaque handle (currently called uri in the spec), which the
> client can use to request things from the backend (such as a pure
> text preview, or snippets or whatever attributes).

If this is going to be supported it needs some kinds of streaming
framework alongside with search engine (except in the cases there the
search engines has this feature integrated). I'm experimenting with
using gstreamer for this. Lets say the opaque handle is
"man-page:man(1)" and the user (application) want to view it as, let's
say, html. Gstreamer can then handle this provided you install plugins
for fetching man-pages and for converting troff to html. This feature is
also required for the indexer in order to get the documents and to
extract the text from them (like for indexing the lyrics in
midi-files). I suspect that what I just wrote will appear as very scary
to most people reading it. =)

(A streaming engine can also be used to highlight the words causing the
hit, but this is much more complicated than it seems at first.)

>  - A file name which may be of use to some clients. I'm not so
> completely sure it's strictly needed, but I'm also not comfortable
> deciding that it's not. Maybe this could be one of the retrievable
> attributes actually (out of GetProperties()). This would solve the
> issue inside the current interface definition, and ok with me.

My guess is that this is what will get the most votes, because it's
much easier to implement.

>  > > - Using the query string as a query identifier is certainly
>  > > feasible (ie for repeated calls to Query() with successive
>  > > offsets), but it somehow doesn't feel right. Shouldn't there be
>  > > some kind of specific query identifier ? Query strings can be
>  > > quite big (ie, after expansion by some preprocessor).
>  > 
>  > As I wrote before I think it's a good idea to have a search
>  > object. The search represents a running/finished search and is
>  > created then the search is started (by submitting the query). As
>  > opposed to a query object which usually refers to a compiled query
>  > that might not have been submitted yet.
> 
> Yes, I think there will be query and search objects (your terms) in
> any reasonable back-end implementation. The question is how visible
> they will be on the client side.

If there is a search object (which might just be a unique number) on
the client side there is no need for the client keep the query around.
It can then be used to get information about the search (hits found so
far) and stuff related to the hits. The other option (as I see it)
would be to use the query string for this, which I think is a little
awkward.

> On the other hand (this refers also to one or your previous remarks),
> a desktop search framework does have different issues compared to a
> shared one: the index will typically be smaller and the available
> resources much greater, so that it may be just acceptable to run a
> query again and again, and count on the system cache to have the
> appropriate disk blocks in memory...

Yes, you are probably right about that. But on the other hand there
seem to be concerns about the speed anyway. And I think there is no
reason to create slow designs just because (most) people have really
fast computers. Things should be as fast as possible, as long as the
design is clean and doesn't degenerate in hackish API horrors (and I
have seem the worst of that in a few commercial products).

But I think the most important thing is that then a standard is created
it is good enough not to hinder a certain approach. If a library API is
defined it can be used as a wrapper around some search engine, or it
could be used to communicate with a daemon (perhaps using dbus) and
everybody can have it their way. (My secret hopes are that this will
result in something that supersedes my implementation, so I can ditch
it and work on this one instead.)