simple search api (was Re: mimetype standardisation by testsets)

Fri Nov 24 15:39:37 EET 2006

On Fri, 24 Nov 2006 12:25:41 +0100
Jean-Francois Dockes <jean-francois.dockes at wanadoo.fr> wrote:

> Here follow my impressions after reading the Wasabi Draft document.
> 
> Query language:
> --------------
> 
> I think that the choice of using a google-like language is not good.
> This kind of query language was primarily designed to be
> human-typable and have not much else in their favour. If the Wasabi
> initiative is going to result in a general-purpose tool, enabling
> search user interfaces of various level of sophistication to connect
> to the indexing/search engine of their choice, we need more
> flexibility.
> 
> I see no "a priori" necessity why the Wasabi query language should be
> the same or even close to what a user types. Some front-ends may not
> have a query language at all (have a gui query builder instead).
> 
> As I see it, the main problem that the language presents is the lack
> of separation between data and operators/qualifiers. This results in
> several nasty consequences, some of which have been noted previously
> in the discussion: 
>  - Lack of national language neutrality
>  - Necessity to use a parser different from the indexing text
> splitter.
>  - Lack of extensibility.
> 
> The language that is presently described on the draft page also lacks
> the capability to express simple boolean queries like :
> 
>    (beatles OR lennon) AND (unplugged OR accoustic)
> 
>   [I am *not* suggesting the above syntax, this is just for the sake
> of illustration]
> 
> Even if you believe that no user will ever want to search for this (I
> have proof to the contrary), this more or less contrains query
> expansion (ie: by thesaurus or other) to happen on the backend side.
> 
> Also we had "c++" "c#" and ".net", I can't see what would protect us
> from -acme or +Whoa. To be future-proof, we need a format where
> search data strings are clearly separated from operators and keywords.
> 
> There has been a lot of work performed on text search query languages
> for a long time. I believe that it would be interesting to study the
> possibility of reusing one of the languages which resulted from this
> long evolution such as CQL
> (http://www.loc.gov/standards/sru/cql/index.html), or one of the
> languages related to z39-50 (ie, for a description:
> http://www.indexdata.dk/yaz/doc/tools.tkl), *or any other choice
> which has seen some usage beyond user-typed queries in web search
> engines*.
> 
> If we don't find an appropriate established language, I see at least
> two options for a more structured approach:
>  - No query language: use a data structure representing the parsed
> query tree.
>  - Use an xml-based approach for more structure and extensibility.
> 
> I'll give an example for the second idea, but *I don't speak xml at
> all*, so please, be indulgent:
> 
> <query type="and">
>   <query type="phrase" distance=3>let it be</query>
>   <query type="near" distance=10>blue paper</query>
>   <query type="or">someword someotherword</query>
>   <query type="andnot">wall</clause>
> </query>
> 
> which would result into the following in a boolean language (xapian
> query language in this case):
> 
> ((((let PHRASE 3 it PHRASE 3 be) AND (blue NEAR 12 paper) AND
> (someword OR someotherword)) AND_NOT wall)) 
> 
> For a front-end using a google-like syntax, it should be easy enough
> to transform "banana moon -recipe" into:
> 
> <query type="and">
>   <query type="and">banana moon</query>
>   <query type="andnot">recipe</query>
> </query>
>
> And the reverse operation should be reasonably trivial too, with help
> from the omnipresent xml parser library.
> 
> This is just an exemple structure, maybe there would be an advantage
> or necessity to separate a top-level <query> and ie, <clause>
> elements, etc.
> 
> I see several advantages to this approach:
> - It can quite probably be extended and versioned while retaining some
>   level of compatibility (unstructured query parsers are brittle: add
>   something, break everything).
> - The search data is clearly separated so that you can use the
> indexing text processor to extract the search terms (this is
> important).
> - There is no parser to write as you can use your preferred xml
> parser.
> 
> Things like restriction to some field/switch (<query index="title">),
> or case sensitivity (case="ignore), etc.. can be easily expressed at
> any level as attributes.
> 
> Ok, enough for now, my only hope here is to restart thinking about the
> query language. 

I agree with everything above. By the way, it might also be useful
to be able to add wight to the sub-queries.

> More specific remarks about the current documents:
> 
> Query language again:
> - Phrases: I see no reason to make phrases unoptionally
>   case-sensitive. Case sensitivity should be an option for any query
>   part. Case-sensitivity is a very expensive proposition for an
> indexer, and I don't think that Recoll is the only one not supporting
> it at all (same for diacritic marks by the way).
> - wildcards/masking: maybe there should be some kind of option to
> turn this on/off, but the current language does not make it easy. Or
> at the very minimum specify \-escaping or such.
> - There is some discussion on the page of choosing attribute names and
>   aliases to suit the habits of such and such tool. I don't think
> that this is the right approach: better choose a well defined set of
> attributes, and let the front-ends do the translation (and define a
> mechanism for extensibility too).
> - There must be some provision to control stemming. Again, something
> that would be easy to do in a structured language, or already
> provided for in one of the existing ones.
> 
> API:
> - Documents and files are not the same thing (think email message
> inside an Inbox, Knotes). Both have their uses on the client side
> though (document identifier to request a snippet, or a text preview,
> file to, well, do something with the file). I don't know of a
> standard way to designate a message inside an mbox file, this is a
> tricky issue. We can probably see the document identifier as opaque,
> and interpreted only in the backend. The file identifier needs to be
> visible. Or is there a standard way to separate the File and Subdoc
> parts in what the draft calls uris ?

Are you also thinking the problem of presenting the right (virtual)
document to the user? Having an opaque identifier for the document is a
good idea. But this requires that the backend also knows about how to
create something the user can view out of this identifier, otherwise
it's not of much use. The indexer always has some kind of filter which
at least turns the document into a stream of words. Are these thoughts
perhaps beyond the scope of problem discussed?

> - Using the query string as a query identifier is certainly feasible
> (ie for repeated calls to Query() with successive offsets), but it
> somehow doesn't feel right. Shouldn't there be some kind of specific
> query identifier ? Query strings can be quite big (ie, after
> expansion by some preprocessor).

As I wrote before I think it's a good idea to have a search object. The
search represents a running/finished search and is created then the
search is started (by submitting the query). As opposed to a query
object which usually refers to a compiled query that might not have been
submitted yet.