[Xesam] Xesam pending changes for RC2

Mikkel Kamstrup Erlandsen mikkel.kamstrup at gmail.com
Thu Dec 13 13:14:20 PST 2007


On 13/12/2007, Jaap Karssenberg <j.g.karssenberg at alumnus.utwente.nl> wrote:
> Hi,
>
> Below a proposal for a grammar for the Xesam User Search Language and
> some details Mikkel and I discussed in private mails. Please comment on
> the proposed implementation details and the discussion items.

It should be noted that none of us are adept into *BNF grammars, so
this should go through careful review before we release it.

> EBNF style grammar:
>
> (* Xesam User Search Language EBNF *)
> (* http://en.wikipedia.org/wiki/Extended_Backus-Naur_form *)
>
> space = SP | HTAB | CR LF | LF ;
> word = ALNUM , { ALNUM } ;
> modifier = ALPHA ;
> phrase = DQUOTE , { VCHAR } , DQUOTE , { modifier } ;
>
> collect = "AND" | "and" | "&&" | "OR" | "or" | "||" ;
> include = "+" | "-" ;
> keyword = word ;
> relation = ":" | "=" | "<=" | ">=" | "<" | ">" ;
> select = keyword , relation;
> term = phrase | word ;
>
> part = [ include ] , [ select ] , term ;
> query = part , { [ space , collect ] , space , part }
>
> (* End Xesam USL EBNF *)
>
> Quick explanation of the grammar syntax:
>
> SP, HTAB, CR LF and LF are space, tab and two different line endings
> ALNUM is all letters and numbers (A-Z a-z 0-9) (see note below)
> ALPHA is all letters (A-Z a-z)
> VCHAR is all visible characters (excludes control chars) (see note below)
> DQUOTE is "
> | means OR
> , means concatenation
> [ ] is optional part
> { } repeats 0 or more times (thus implies optional)
>
> Implementation details;
>
> • When no operator ("collect") is given, default to "AND" (see
> discussion below)
>
> Proposed implementation details:
>
> • For "phrase" allow escaping any DQUOTE with a backslash "\", this
> implies that literal backslashes also need to be escaped.
> • Extend ALNUM and ALPHA to include all UTF-8 letters
> • Extend VCHAR to include all printable UTF-8 characters

I see that this is not explictely noted in the search language spec,
but I think it is beyond discussion. Especially because we have to
pass the query over DBUS.

I added this as item 14 to http://xesam.org/main/XesamUpdates

> • For "word" extend to VCHAR but exclude all whitespace and ":", "=",
> ">", "<", "|" and "&"
> • This will not reserve any chars for future extensions

This is implicit in the current definition of the USL, but might do
good being spelled out. It is a very good point that we do not
currently reserve any special chars for future extensions.

I am not sure that we want to do that either. The USL is not meant to
be an extensible search language more than it already is (via
modifiers on phrases). It should be as simple as possible while still
allowing search engines to show their features.

> • Leave it up to the server to decide what to do when an unknown keyword
> is encountered
>
> Items under discussion:
>
> • Do we want to allow other escapes like "\n", "\t" in "phrase" ?

I've been thinking about this. Since this is a query language designed
for end users we should strive hardly to remove any need for escaping
all together.

Also considering that the 'r' modifier implies that a phrase should be
matched as a regex we really want to avoid needing to escape \ in
phrases.

Maybe a double single-ping in a phrase could escape a ". Fx "foo ''bar
baz''" will match the phrase where bar baz is enclosed in double
quotes.

>
> Maybe consider this an optional extension.
>
> • Allow spaces between keyword, relation and term
>
> Will the user understand the difference between:
>
> creator = "Jimi Hendrix"
> creator="Jimi Hendrix"
>
> I think not. This also implies that this is allowed:
>
> type : audio
>
> which may not seem like clean syntax, but has very little chance of
> being mis-interpreted by the user.

For the record, I +1 this.

> secundairy to this we might also want to allow space between "include"
> and "term".

 here I am not sure I think it is a good idea to allow spaces. It
seems weird and Google does not allow it either. Google compatibility
is a high priority.

> • What precendence do the AND and OR operators have ?
>
> Most programming languages give OR precedence over AND

Name one, I don't know any that doesn't give AND precedence :-)

Try fx http://www.google.dk/search?q=boolean+operator+precedence and
check Java and C++

> 1 2 3 AND 4 5 6 OR 7 8 9 = (1 and 2 .. and 6) or (7 and 8 and 9)
>
> Giving AND precedence over OR seems less intuitive to me since AND is
> the default operator.

Yes. This was discussed a while back on XDG. The conclusion was that
boolean ops was mostly for advanced users who where likely to
anticipate the standard precedence.


Thanks for the round up. Cheers,
Mikkel


More information about the Xesam mailing list