2006/11/24, Jean-Francois Dockes <<a href="mailto:jean-francois.dockes@wanadoo.fr">jean-francois.dockes@wanadoo.fr</a>>:<div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Here follow my impressions after reading the Wasabi Draft document. Query language: -------------- I think that the choice of using a google-like language is not good. This kind of query language was primarily designed to be human-typable and have not much else in their favour. If the Wasabi initiative is going to result in a general-purpose tool, enabling search user interfaces of various level of sophistication to connect to the indexing/search engine of their choice, we need more flexibility. I see no "a priori" necessity why the Wasabi query language should be the same or even close to what a user types. Some front-ends may not have a query language at all (have a gui query builder instead). </blockquote><div> The idea is that the interface be simple. That also means the query language. I don't think every developer in the world should have to speak fluent <insert favorite query language>. Instead use a simpler - close to what we use every day - language, that essentailly doesn't need any parsing from what the user enters (more on this further down). </div> I think we should long term standardize on a more advanced language as well. For this I would like to suggest RDF query, but I really think we shoud keep that out of a _simple_ interface. <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> As I see it, the main problem that the language presents is the lack of separation between data and operators/qualifiers. This results in several nasty consequences, some of which have been noted previously in the discussion: - Lack of national language neutrality - Necessity to use a parser different from the indexing text splitter. - Lack of extensibility.</blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> The language that is presently described on the draft page also lacks the capability to express simple boolean queries like :    (beatles OR lennon) AND (unplugged OR accoustic)</blockquote><div> Allowing brackets like this is a trivial extension - and is also discussed on the wiki (shortly though).  </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">  [I am *not* suggesting the above syntax, this is just for the sake of    illustration] Even if you believe that no user will ever want to search for this (I have proof to the contrary), this more or less contrains query expansion (ie: by thesaurus or other) to happen on the backend side. Also we had "c++" "c#" and ".net", I can't see what would protect us from -acme or +Whoa. To be future-proof, we need a format where search data strings are clearly separated from operators and keywords. </blockquote><div> Searching for "c++" (with the quotations) should give the desired result. I see the drawback of having phrase searches being case sensitive here though. Users probably want to see posts matching "c++" OR "C++" (likewise for your to other examples). </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">There has been a lot of work performed on text search query languages for a long time. I believe that it would be interesting to study the possibility of reusing one of the languages which resulted from this long evolution such as CQL (<a href="http://www.loc.gov/standards/sru/cql/index.html"> http://www.loc.gov/standards/sru/cql/index.html</a>), or one of the languages related to z39-50 (ie, for a description: <a href="http://www.indexdata.dk/yaz/doc/tools.tkl">http://www.indexdata.dk/yaz/doc/tools.tkl</a> ), *or any other choice which has seen some usage beyond user-typed queries in web search engines*.</blockquote><div> Well, the current language proposal is actually a subset of the Lucene query language, and I think it is beyond discussion that the Lucene language is a success. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">If we don't find an appropriate established language, I see at least two options for a more structured approach: - No query language: use a data structure representing the parsed query tree. - Use an xml-based approach for more structure and extensibility. I'll give an example for the second idea, but *I don't speak xml at all*, so please, be indulgent: <query type="and">   <query type="phrase" distance=3>let it be</query>   <query type="near" distance=10>blue paper</query>   <query type="or">someword someotherword</query>   <query type="andnot">wall</clause> </query> which would result into the following in a boolean language (xapian query language in this case): ((((let PHRASE 3 it PHRASE 3 be) AND (blue NEAR 12 paper) AND (someword OR   someotherword)) AND_NOT wall)) For a front-end using a google-like syntax, it should be easy enough to transform "banana moon -recipe" into: <query type="and">   <query type="and">banana moon</query>   <query type="andnot">recipe</query> </query> And the reverse operation should be reasonably trivial too, with help from the omnipresent xml parser library. This is just an exemple structure, maybe there would be an advantage or necessity to separate a top-level <query> and ie, <clause> elements, etc. I see several advantages to this approach: - It can quite probably be extended and versioned while retaining some   level of compatibility (unstructured query parsers are brittle: add   something, break everything). - The search data is clearly separated so that you can use the indexing   text processor to extract the search terms (this is important). - There is no parser to write as you can use your preferred xml parser. Things like restriction to some field/switch (<query index="title">), or case sensitivity (case="ignore), etc.. can be easily expressed at any level as attributes. Ok, enough for now, my only hope here is to restart thinking about the query language.</blockquote><div> It seems to me that what you are suggesting is actually the RDF query language (as I mentioned earlier). I think it would be a good idea to have a method in an advanced interface to query using RDF (or anything other we decide upon) - I just think we should leave it out of org.freedesktop.search.simple.  </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">More specific remarks about the current documents: Query language again: - Phrases: I see no reason to make phrases unoptionally   case-sensitive. Case sensitivity should be an option for any query   part. Case-sensitivity is a very expensive proposition for an indexer,   and I don't think that Recoll is the only one not supporting it at all   (same for diacritic marks by the way).</blockquote><div> The spotlight way of doing this is to add a c to the end of the phrase fx. "Hello World"c. I dislike using letters like this, I could accept using a symbol to mark a phrase as case sensitive. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- wildcards/masking: maybe there should be some kind of option to turn this   on/off, but the current language does not make it easy. Or at the very   minimum specify \-escaping or such.</blockquote><div> Do you want to turn this off in the search engine? I don't think I understand what you mean... We could have implicit escaping by quoting the string "c* algrebra"  fx. I'm not against escaping of special chars as such. except I don't think is standard in Lucene... </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - There is some discussion on the page of choosing attribute names and   aliases to suit the habits of such and such tool. I don't think that this   is the right approach: better choose a well defined set of attributes,   and let the front-ends do the translation (and define a mechanism for   extensibility too).</blockquote><div> There will be a specific set of mandatory attribute/fields/switches, and backends are free to implement any number of custom switches - how is that not extensible? When we get to define an "advanced" interface these switches could be introspectable. It will be the same case for custom groups (arguments to the group switch - such as "contacts", "email", "files" etc..). A predefined set of mandatory names, with options for introspecting any additional fields via an advanced search interface. </div> I really *really* think we should focus on the simple dbus interface for now. <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> - There must be some provision to control stemming. Again, something that   would be easy to do in a structured language, or already provided for in   one of the existing ones.</blockquote><div> Assuming we are talking about the search language. Is a "language"  switch not enough? "�l the language:danish" would match posts with "�l" and "the" while "�l the language:english" would match fx. "ol" and discard "the" as a stop word. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">API: - Documents and files are not the same thing (think email message inside an   Inbox, Knotes). Both have their uses on the client side though (document   identifier to request a snippet, or a text preview, file to, well, do   something with the file). I don't know of a standard way to designate a   message inside an mbox file, this is a tricky issue. We can probably see   the document identifier as opaque, and interpreted only in the   backend. The file identifier needs to be visible. Or is there a standard   way to separate the File and Subdoc parts in what the draft calls uris ?</blockquote><div> All indexable objects have a unique URI, that is kind of an unspoken premise of the current draft. I don't think we should have a standard way to point at an attachment inside an email. Evolution (the mail client) uses one kind of uris and I expect KMail to use another - I don't find it realistic that we expect them to change that. The only assumption I think we should make is that indexed object be it conversations, emails, notes, etc, should be uniquely determined by their uri. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- Using the query string as a query identifier is certainly feasible (ie   for repeated calls to Query() with successive offsets), but it somehow   doesn't feel right. Shouldn't there be some kind of specific query   identifier ? Query strings can be quite big (ie, after expansion by some   preprocessor).</blockquote><div> That is a good point, this more or less implies the need for a server side Query object... This requires a fair deal of logic to be imposed on the backends though (session awareness), and I'm not keen on requiring any more than an interface. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">This is a very long message. If you're still with me, thank you.</blockquote><div> No, thank you :-) Cheers, Mikkel </div> </div>