[Xesam] suggestions from fosscamp

Mikkel Kamstrup Erlandsen mikkel.kamstrup at gmail.com
Sat May 17 12:19:23 PDT 2008


2008/5/17 Jos van den Oever <jvdoever at gmail.com>:

> Hello all,
>
> Yesterday I blogged [1] about attending FOSSCamp. For the list I'd
> like to go into a bit more detail about two points I mentioned there.
> I'd like to do a small initial discussion on this topics and if we
> would like to start working on any of these, go to a separate on them.
>

Thanks for the write up!

In retrospect my below comments probably sound more negative than I am. So
let me first make it clear that I think all the points you raised can, and
should be, addressed in some way or other :-)


>
> common index file format
>
> Distributions want to install preindexed files such as documentation.
> All programs that implement Xesam would ideally implement support for
> one common format. Such a format may be heavily optimized for reading
> only and there should be a mechanism to tell the indexer about the
> presence of the index.
> For this to become a reality we need to look at what we need in such a
> file format. I want to avoid mentioning any technicalities but focus
> on requirements. Here's what I think such a format would need, please
> add more requirements.
>  - cross platform
>  - optimized for reading (we're talking about read-only files)
>  - small indexes
>  - good support for the full xesam search language features
>  - good performance for many small indexes so distros can do
> fine-grained distribution
>

I have several concerns about this. I know that you request feedback on the
requirements, but I can't help but mentioning a few technical problems I
see.

 * If we require servers to support external indexes we open up the whole
federated-search-can-o-worms. Fx how to normalize hit ranking, do shared
term frequencies (which is needed for "correct" ranking)

 * It might make it trickier to do some hardcore optimizations in your
native index format. Consider for example the case that the whole Lucene API
relies pretty heavily on the fact that the docs are ordered sequentially in
the index. Allowing the search engines to make private assumptions about the
index structure can allow for some nice optimizations.

My proposal to solve the "3rd party index problem" would simply be to have
apps ships a huge xml file with the metadata (in the format we need to
specify for shared harvesting anyway). This solves the two problems I
mention above too. Given that the 3rd party data is stored in one big xml
(or other more optimized) file to be indexed should make indexation pretty
quick.

It does give some problems in regards to having the index data on a per-user
basis though. Nothing that can't be handled though.



> management interface
>
> To make Xesam attractive for program authors, we need an API that the
> allows applications to ensure that particular files are indexed. This
> means that e.g. Amarok could tell the indexer 'index the title and
> artist for all mp3 files on the system' or 'use this index that my
> application provides'.
> Do you think we could agree on designing an API for doing these things.


We have discussed in the past that we want a common way to add data for
indexation, and I think it is safe to assume that everybody wants that in
some form or other. Comments:

 * Do we not assume that the indexer will always index anything it can find
and comprehend? In that case it should not be needed to tell the indexer to
index anything you need. "If it is not indexed I can't index it".

 * Index polution - if apps are allowed to add data to the index as they see
fit they can also polute the index by ranking their data unduely high (if
they are allowed to tweak rankings). Taking in the assumption that app
authors are likely to think that *their* apps are the most important on the
planet (except for mine which are more important), they will likely overrate
their data.

To close, some nice news from the people I talk to. The people seem
> keen on using Xesam, not Tracker, Strigi or Beagle. They are not
> bothered by being limited to the features in the spec but would like
> the standardization process to go forward to add more. It is clear
> that people like the choice which they get by being able to choose one
> of a few standard compliant programs.


Nice to hear!

I should probably start a thread about how we should move on with Xesam...

Cheers,
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xesam/attachments/20080517/bace4e4a/attachment.html 


More information about the Xesam mailing list