2006/11/24, Jean-Francois Dockes <<a href="mailto:jean-francois.dockes@wanadoo.fr">jean-francois.dockes@wanadoo.fr</a>>:<div><span class="gmail_quote"></span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Here follow my impressions after reading the Wasabi Draft document.<br><br>Query language:<br>--------------<br><br>I think that the choice of using a google-like language is not good. This<br>kind of query language was primarily designed to be human-typable and have
<br>not much else in their favour. If the Wasabi initiative is going to result<br>in a general-purpose tool, enabling search user interfaces of various level<br>of sophistication to connect to the indexing/search engine of their choice,
<br>we need more flexibility.<br><br>I see no "a priori" necessity why the Wasabi query language should be the<br>same or even close to what a user types. Some front-ends may not have a<br>query language at all (have a gui query builder instead).
</blockquote><div><br>The idea is that the interface be simple. That also means the query language. I don't think every developer in the world should have to speak fluent <insert favorite query language>. Instead use a simpler - close to what we use every day - language, that essentailly doesn't need any parsing from what the user enters (more on this further down).
<br></div><br>I think we should long term standardize on a more advanced language as well. For this I would like to suggest RDF query, but I really think we shoud keep that out of a _simple_ interface.<br><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
As I see it, the main problem that the language presents is the lack of<br>separation between data and operators/qualifiers. This results in several<br>nasty consequences, some of which have been noted previously in the<br>
discussion:<br> - Lack of national language neutrality<br> - Necessity to use a parser different from the indexing text splitter.<br> - Lack of extensibility.</blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
The language that is presently described on the draft page also lacks the<br>capability to express simple boolean queries like :<br><br> (beatles OR lennon) AND (unplugged OR accoustic)</blockquote><div><br>Allowing brackets like this is a trivial extension - and is also discussed on the wiki (shortly though).
<br> </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> [I am *not* suggesting the above syntax, this is just for the sake of<br> illustration]
<br><br>Even if you believe that no user will ever want to search for this (I have<br>proof to the contrary), this more or less contrains query expansion (ie: by<br>thesaurus or other) to happen on the backend side.<br><br>
Also we had "c++" "c#" and ".net", I can't see what would protect us from<br>-acme or +Whoa. To be future-proof, we need a format where search data<br>strings are clearly separated from operators and keywords.
</blockquote><div><br>Searching for "c++" (with the quotations) should give the desired result. I see the drawback of having phrase searches being case sensitive here though. Users probably want to see posts matching "c++" OR "C++" (likewise for your to other examples).
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">There has been a lot of work performed on text search query languages for a<br>
long time. I believe that it would be interesting to study the possibility<br>of reusing one of the languages which resulted from this long evolution<br>such as CQL (<a href="http://www.loc.gov/standards/sru/cql/index.html">
http://www.loc.gov/standards/sru/cql/index.html</a>), or one of<br>the languages related to z39-50 (ie, for a description:<br><a href="http://www.indexdata.dk/yaz/doc/tools.tkl">http://www.indexdata.dk/yaz/doc/tools.tkl</a>
), *or any other choice which has<br>seen some usage beyond user-typed queries in web search engines*.</blockquote><div><br>Well, the current language proposal is actually a subset of the Lucene query language, and I think it is beyond discussion that the Lucene language is a success.
<br><br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">If we don't find an appropriate established language, I see at least two<br>
options for a more structured approach:<br> - No query language: use a data structure representing the parsed query tree.<br> - Use an xml-based approach for more structure and extensibility.<br><br>I'll give an example for the second idea, but *I don't speak xml at all*,
<br>so please, be indulgent:<br><br><query type="and"><br> <query type="phrase" distance=3>let it be</query><br> <query type="near" distance=10>blue paper</query>
<br> <query type="or">someword someotherword</query><br> <query type="andnot">wall</clause><br></query><br><br>which would result into the following in a boolean language (xapian query
<br>language in this case):<br><br>((((let PHRASE 3 it PHRASE 3 be) AND (blue NEAR 12 paper) AND (someword OR<br> someotherword)) AND_NOT wall))<br><br>For a front-end using a google-like syntax, it should be easy enough to
<br>transform "banana moon -recipe" into:<br><br><query type="and"><br> <query type="and">banana moon</query><br> <query type="andnot">recipe</query>
<br></query><br><br>And the reverse operation should be reasonably trivial too, with help from<br>the omnipresent xml parser library.<br><br>This is just an exemple structure, maybe there would be an advantage or<br>
necessity to separate a top-level <query> and ie, <clause> elements,<br>etc.<br><br>I see several advantages to this approach:<br>- It can quite probably be extended and versioned while retaining some<br> level of compatibility (unstructured query parsers are brittle: add
<br> something, break everything).<br>- The search data is clearly separated so that you can use the indexing<br> text processor to extract the search terms (this is important).<br>- There is no parser to write as you can use your preferred xml parser.
<br><br>Things like restriction to some field/switch (<query index="title">), or<br>case sensitivity (case="ignore), etc.. can be easily expressed at any level<br>as attributes.<br><br>Ok, enough for now, my only hope here is to restart thinking about the
<br>query language.</blockquote><div><br><br>It seems to me that what you are suggesting is actually the RDF query language (as I mentioned earlier). <br>I think it would be a good idea to have a method in an advanced interface to query using RDF (or anything other we decide upon) - I just think we should leave it out of
org.freedesktop.search.simple.<br> </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">More specific remarks about the current documents:
<br><br>Query language again:<br>- Phrases: I see no reason to make phrases unoptionally<br> case-sensitive. Case sensitivity should be an option for any query<br> part. Case-sensitivity is a very expensive proposition for an indexer,
<br> and I don't think that Recoll is the only one not supporting it at all<br> (same for diacritic marks by the way).</blockquote><div><br><br>The spotlight way of doing this is to add a c to the end of the phrase fx. "Hello World"c. I dislike using letters like this, I could accept using a symbol to mark a phrase as case sensitive.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- wildcards/masking: maybe there should be some kind of option to turn this<br>
on/off, but the current language does not make it easy. Or at the very<br> minimum specify \-escaping or such.</blockquote><div><br>Do you want to turn this off in the search engine? I don't think I understand what you mean...
<br>We could have implicit escaping by quoting the string "c* algrebra" fx. I'm not against escaping of special chars as such. except I don't think is standard in Lucene...<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- There is some discussion on the page of choosing attribute names and<br> aliases to suit the habits of such and such tool. I don't think that this<br> is the right approach: better choose a well defined set of attributes,
<br> and let the front-ends do the translation (and define a mechanism for<br> extensibility too).</blockquote><div><br>There will be a specific set of mandatory attribute/fields/switches, and backends are free to implement any number of custom switches - how is that not extensible? When we get to define an "advanced" interface these switches could be introspectable.
<br><br>It will be the same case for custom groups (arguments to the group switch - such as "contacts", "email", "files" etc..). A predefined set of mandatory names, with options for introspecting any additional fields via an advanced search interface.
<br></div><br>I really *really* think we should focus on the simple dbus interface for now.<br><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
- There must be some provision to control stemming. Again, something that<br> would be easy to do in a structured language, or already provided for in<br> one of the existing ones.</blockquote><div><br>Assuming we are talking about the search language. Is a "language" switch not enough? "ĝl the language:danish" would match posts with "ĝl" and "the" while "ĝl the language:english" would match fx. "ol" and discard "the" as a stop word.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">API:<br>- Documents and files are not the same thing (think email message inside an
<br> Inbox, Knotes). Both have their uses on the client side though (document<br> identifier to request a snippet, or a text preview, file to, well, do<br> something with the file). I don't know of a standard way to designate a
<br> message inside an mbox file, this is a tricky issue. We can probably see<br> the document identifier as opaque, and interpreted only in the<br> backend. The file identifier needs to be visible. Or is there a standard
<br> way to separate the File and Subdoc parts in what the draft calls uris ?</blockquote><div><br>All indexable objects have a unique URI, that is kind of an unspoken premise of the current draft. I don't think we should have a standard way to point at an attachment inside an email. Evolution (the mail client) uses one kind of uris and I expect KMail to use another - I don't find it realistic that we expect them to change that. The only assumption I think we should make is that indexed object be it conversations, emails, notes, etc, should be uniquely determined by their uri.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- Using the query string as a query identifier is certainly feasible (ie<br> for repeated calls to Query() with successive offsets), but it somehow
<br> doesn't feel right. Shouldn't there be some kind of specific query<br> identifier ? Query strings can be quite big (ie, after expansion by some<br> preprocessor).</blockquote><div><br>That is a good point, this more or less implies the need for a server side Query object... This requires a fair deal of logic to be imposed on the backends though (session awareness), and I'm not keen on requiring any more than an interface.
<br></div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">This is a very long message. If you're still with me, thank you.</blockquote><div>
<br>No, thank you :-)<br><br>Cheers,<br>Mikkel <br></div><br></div><br>