[Xesam] Xesam pending changes for RC2

Jaap Karssenberg j.g.karssenberg at alumnus.utwente.nl
Thu Dec 13 12:22:51 PST 2007


Hi,

Below a proposal for a grammar for the Xesam User Search Language and 
some details Mikkel and I discussed in private mails. Please comment on 
the proposed implementation details and the discussion items.

EBNF style grammar:

(* Xesam User Search Language EBNF *)
(* http://en.wikipedia.org/wiki/Extended_Backus-Naur_form *)

space = SP | HTAB | CR LF | LF ;
word = ALNUM , { ALNUM } ;
modifier = ALPHA ;
phrase = DQUOTE , { VCHAR } , DQUOTE , { modifier } ;

collect = "AND" | "and" | "&&" | "OR" | "or" | "||" ;
include = "+" | "-" ;
keyword = word ;
relation = ":" | "=" | "<=" | ">=" | "<" | ">" ;
select = keyword , relation;
term = phrase | word ;

part = [ include ] , [ select ] , term ;
query = part , { [ space , collect ] , space , part }

(* End Xesam USL EBNF *)

Quick explanation of the grammar syntax:

SP, HTAB, CR LF and LF are space, tab and two different line endings
ALNUM is all letters and numbers (A-Z a-z 0-9) (see note below)
ALPHA is all letters (A-Z a-z)
VCHAR is all visible characters (excludes control chars) (see note below)
DQUOTE is "
| means OR
, means concatenation
[ ] is optional part
{ } repeats 0 or more times (thus implies optional)

Implementation details;

• When no operator ("collect") is given, default to "AND" (see 
discussion below)

Proposed implementation details:

• For "phrase" allow escaping any DQUOTE with a backslash "\", this 
implies that literal backslashes also need to be escaped.
• Extend ALNUM and ALPHA to include all UTF-8 letters
• Extend VCHAR to include all printable UTF-8 characters
• For "word" extend to VCHAR but exclude all whitespace and ":", "=", 
">", "<", "|" and "&"
• This will not reserve any chars for future extensions
• Leave it up to the server to decide what to do when an unknown keyword 
is encountered

Items under discussion:

• Do we want to allow other escapes like "\n", "\t" in "phrase" ?

Maybe consider this an optional extension.

• Allow spaces between keyword, relation and term

Will the user understand the difference between:

creator = "Jimi Hendrix"
creator="Jimi Hendrix"

I think not. This also implies that this is allowed:

type : audio

which may not seem like clean syntax, but has very little chance of 
being mis-interpreted by the user.

secundairy to this we might also want to allow space between "include" 
and "term".

• What precendence do the AND and OR operators have ?

Most programming languages give OR precedence over AND

1 2 3 AND 4 5 6 OR 7 8 9 = (1 and 2 .. and 6) or (7 and 8 and 9)

Giving AND precedence over OR seems less intuitive to me since AND is 
the default operator.

1 2 3 OR 4 5 6 = 1 and 2 and (3 or 4) and 5 and 6

However if we define the "+" include as the default and give "AND' 
precedence over "OR" and "OR" precedence over includes it gets better.

1 2 3 OR 4 5 6 = (1 and 2 and 3) OR (4 and 5 and 6)
1 2 3 AND 4 5 6 OR 7 8 0 = (1 and 2 and 3) AND ( (4 and 5 and 6) OR (7 
and 8 and 9) )

To me it seems best to go for the first option and give "OR" precedence 
over "AND".


Regards,

Jaap <pardus at cpan.org>



More information about the Xesam mailing list