[Xesam] Xesam pending changes for RC2
Jaap Karssenberg
j.g.karssenberg at alumnus.utwente.nl
Thu Dec 13 12:22:51 PST 2007
Hi,
Below a proposal for a grammar for the Xesam User Search Language and
some details Mikkel and I discussed in private mails. Please comment on
the proposed implementation details and the discussion items.
EBNF style grammar:
(* Xesam User Search Language EBNF *)
(* http://en.wikipedia.org/wiki/Extended_Backus-Naur_form *)
space = SP | HTAB | CR LF | LF ;
word = ALNUM , { ALNUM } ;
modifier = ALPHA ;
phrase = DQUOTE , { VCHAR } , DQUOTE , { modifier } ;
collect = "AND" | "and" | "&&" | "OR" | "or" | "||" ;
include = "+" | "-" ;
keyword = word ;
relation = ":" | "=" | "<=" | ">=" | "<" | ">" ;
select = keyword , relation;
term = phrase | word ;
part = [ include ] , [ select ] , term ;
query = part , { [ space , collect ] , space , part }
(* End Xesam USL EBNF *)
Quick explanation of the grammar syntax:
SP, HTAB, CR LF and LF are space, tab and two different line endings
ALNUM is all letters and numbers (A-Z a-z 0-9) (see note below)
ALPHA is all letters (A-Z a-z)
VCHAR is all visible characters (excludes control chars) (see note below)
DQUOTE is "
| means OR
, means concatenation
[ ] is optional part
{ } repeats 0 or more times (thus implies optional)
Implementation details;
• When no operator ("collect") is given, default to "AND" (see
discussion below)
Proposed implementation details:
• For "phrase" allow escaping any DQUOTE with a backslash "\", this
implies that literal backslashes also need to be escaped.
• Extend ALNUM and ALPHA to include all UTF-8 letters
• Extend VCHAR to include all printable UTF-8 characters
• For "word" extend to VCHAR but exclude all whitespace and ":", "=",
">", "<", "|" and "&"
• This will not reserve any chars for future extensions
• Leave it up to the server to decide what to do when an unknown keyword
is encountered
Items under discussion:
• Do we want to allow other escapes like "\n", "\t" in "phrase" ?
Maybe consider this an optional extension.
• Allow spaces between keyword, relation and term
Will the user understand the difference between:
creator = "Jimi Hendrix"
creator="Jimi Hendrix"
I think not. This also implies that this is allowed:
type : audio
which may not seem like clean syntax, but has very little chance of
being mis-interpreted by the user.
secundairy to this we might also want to allow space between "include"
and "term".
• What precendence do the AND and OR operators have ?
Most programming languages give OR precedence over AND
1 2 3 AND 4 5 6 OR 7 8 9 = (1 and 2 .. and 6) or (7 and 8 and 9)
Giving AND precedence over OR seems less intuitive to me since AND is
the default operator.
1 2 3 OR 4 5 6 = 1 and 2 and (3 or 4) and 5 and 6
However if we define the "+" include as the default and give "AND'
precedence over "OR" and "OR" precedence over includes it gets better.
1 2 3 OR 4 5 6 = (1 and 2 and 3) OR (4 and 5 and 6)
1 2 3 AND 4 5 6 OR 7 8 0 = (1 and 2 and 3) AND ( (4 and 5 and 6) OR (7
and 8 and 9) )
To me it seems best to go for the first option and give "OR" precedence
over "AND".
Regards,
Jaap <pardus at cpan.org>
More information about the Xesam
mailing list