[XESAM] Questions regarding Xesam Query Language

Mon Feb 18 02:23:39 PST 2008

On 17/09/2007, Anders Rune Jensen <anders at iola.dk> wrote:
> Hi
>
> I've been reading the Xesam Query Language specification and I have a questions:
>
> 1) How do I get all tags for all files? Preferable as list<name,
> count>. The same is true for other attributes such as mime-types.

This boarders on metadata management, and the search API is not
optimized for this usage. This will be more natural when we have the
metadata storage API as well as an Index API (which can provide stats
about the index).

I see two ways to do this purely within the search API.

1.1) Get all tags, by getting all non-empty titles of tag objects:

session.set_property("hit.fields", ["title"]);

and then query:

<query content="Tag">
  <equals negate="true">
    <field name="title"/>
    </string></string>
  </equals>
</query>

This will give you a list of tag titles. Then with another "counting
session" count the documents:

counting_ses.set_property ("hit.fields", [])

for tag in $tags:
  search = new Search(counting_ses,
    "<query>
      <equals>
        <field name="userKeyword"/>
        </string>$tag</string>
      </equals>
    </query>");
  search.start();
  wait_until_SearchDone_emitted(search);
  count = search.get_hit_count();
  search.close();

Since the counting session request no fields an you only retrieve the
hit count, this can be made highly efficient if the search engine is
smart enough.

1.2) The other way to do this. SImple query all objects with
userKeyword != "" and then count up how many occurences of each
userKeyword there are. This is much simpler than 1.1, but also a lot
less efficient I think.

> 2) How can I sort the output? Sorting is very important when you
> specify a maximum number of results one wants to have returned (btw.
> is this defined?).

The session properties sort.order, sort.primary and sort.secondary are
your friends. I just noticed that the documentation on the sort.*
properties is a bit outdated, but the idea should be clear enough.

> 3) I find the names fullText and contains and what they do a little
> strange. I would have thought that contains did what fullText does, so
> maybe renaming it to containsWord would help?

The <string> element defaults to being interpreted as a phrase if
multiple words occur, so containsWord would be even more misleading.
The rule for "contains" is that the provided value element *must* be
present in the field(s) queried. The fullText selector is very loosely
defined and is allowed to be more sloppy.

I can't see that we could find a better name than "contains", but
"fullText" might be unclear.

The problem is that the fullText selector is unclear in essence. It
basically just means "query everything and match in the way you find
best". In my daily work this functionality is often referred to as
full text (or free text) search. Hence the name fullText. I am not
sure this is standard though, and I am open for ideas.

REFS:
fullText selector: http://xesam.org/main/XesamQueryLanguage90#fullText
string element: http://xesam.org/main/XesamQueryLanguage90#string
session props: http://xesam.org/main/XesamSearch90#properties

NOTE TO ALL:
I added the idea about the Index API to
http://xesam.org/main/XesamIteration2. It was discussed on IRC a while
ago, but I must have forgot to add it to the page.

If you have any use cases, ideas, or comments relevant for the second
iteration of Xesam, please add them there.

Cheers,
Mikkel