simple search api (was Re: mimetype standardisation by testsets)

Mon Nov 27 22:27:40 EET 2006

Hi,

On Fri, 2006-11-24 at 12:25 +0100, Jean-Francois Dockes wrote:
> If we don't find an appropriate established language, I see at least two
> options for a more structured approach:
>  - No query language: use a data structure representing the parsed query tree.
>  - Use an xml-based approach for more structure and extensibility.

I strongly agree with the principle behind both of these.

We use data structures (that happened to be serialized to XML when sent
from client to daemon) to represent our queries at the lowest level.  By
default all query parts are ANDed together, although you can then nest
OR blocks.

We do have a query language that is accepted when text is typed in, but
in the end they are converted to the query data structure before they
are processed by the daemon.

> Query language again:
> - Phrases: I see no reason to make phrases unoptionally
>   case-sensitive. Case sensitivity should be an option for any query
>   part. Case-sensitivity is a very expensive proposition for an indexer,
>   and I don't think that Recoll is the only one not supporting it at all
>   (same for diacritic marks by the way).

Case sensitivity is almost never useful and almost never what the user
wants.  Look at any usability study on this.  Beagle is always case
insensitive (as it's handled by the analysis code).

> API:
> - Documents and files are not the same thing (think email message inside an
>   Inbox, Knotes). Both have their uses on the client side though (document
>   identifier to request a snippet, or a text preview, file to, well, do
>   something with the file). I don't know of a standard way to designate a
>   message inside an mbox file, this is a tricky issue. We can probably see
>   the document identifier as opaque, and interpreted only in the
>   backend. 

As you mention, there is no standard for this or almost anything else
that isn't a file or web resource.  What we've done is use URIs as our
identifier at index time.  On the client side, we pass them by default
to standard handlers (xdg-open, desktop-launch, gnome-open) or pass
specialized URIs to individual programs that understand them.  (For
example, Evolution mails are indexed with the email:/// URI scheme that
only Evo understands.  Ditto Evolution's contact and calendar items.)

>   The file identifier needs to be visible. Or is there a standard
>   way to separate the File and Subdoc parts in what the draft calls uris ?

Here we've used the URI fragment to indicate these, but they are very
Beagle-specific.  The first part of a multipart email might be
email://joe@bleh/Inbox?uri=1234#0.  Evolution doesn't support opening
attachments directly, so the client UI knows to interpret these and just
open the mail directly.

This is a bit of an issue for archives, because a tarball might contain
yet another tarball, resulting in a URI like
"file:///home/joe/tarball1.tar.gz#junk/tarball2.tar.gz#foo/bar"  I'm not
sure that two fragment parts are valid.

> - Using the query string as a query identifier is certainly feasible (ie
>   for repeated calls to Query() with successive offsets), but it somehow
>   doesn't feel right. Shouldn't there be some kind of specific query
>   identifier ? Query strings can be quite big (ie, after expansion by some
>   preprocessor).

Agree.  Unique D-Bus object paths per-query seem to make more sense to
me.  (And this is what Beagle used back when it used D-Bus.)

Joe