[Xesam] suggestions from fosscamp

Sat May 17 13:20:48 PDT 2008

Hi all,

On Sat, May 17, 2008 at 10:19 PM, Mikkel Kamstrup Erlandsen
<mikkel.kamstrup at gmail.com> wrote:
> 2008/5/17 Jos van den Oever <jvdoever at gmail.com>:
>>
>> Hello all,
>>
>> Yesterday I blogged [1] about attending FOSSCamp. For the list I'd
>> like to go into a bit more detail about two points I mentioned there.
>> I'd like to do a small initial discussion on this topics and if we
>> would like to start working on any of these, go to a separate on them.
>
> Thanks for the write up!
>
> In retrospect my below comments probably sound more negative than I am. So
> let me first make it clear that I think all the points you raised can, and
> should be, addressed in some way or other :-)
>

 Sorry i am not very positive either, but lets talk about it!

>>
>> common index file format
>>
>> Distributions want to install preindexed files such as documentation.
>> All programs that implement Xesam would ideally implement support for
>> one common format. Such a format may be heavily optimized for reading
>> only and there should be a mechanism to tell the indexer about the
>> presence of the index.

 The index file format is a very low level implementation detail. To
me, it looks like trying to standarice all web applications to use
MySQL, instead of PostgreSQL or any other database system. The
standarization shouldnt go to such a low level.

 At the same time, some kind of low level compatibility could be
interesting for memory cards, in order to avoid reindex the card every
time is plugged on a device/computer. If another indexer built the
index of the contents, could be nice to reuse that work. Instead of a
common index file format, I would suggest to study an "import/export
format", a kind of generic representation of the index data (word,
document, position,...), that the applications could import/export to
their native formats (Lucene, hyperstraier, QDBM...).

 But I am not very confident in this idea either: for instance,
tracker doesnt have "position" information of the words, so it cannot
export that information and then the information will be totally
useless for other indexers.

>> For this to become a reality we need to look at what we need in such a
>> file format. I want to avoid mentioning any technicalities but focus
>> on requirements. Here's what I think such a format would need, please
>> add more requirements.
>>  - cross platform
>>  - optimized for reading (we're talking about read-only files)
>>  - small indexes
>>  - good support for the full xesam search language features
>>  - good performance for many small indexes so distros can do
>> fine-grained distribution
>
> I have several concerns about this. I know that you request feedback on the
> requirements, but I can't help but mentioning a few technical problems I
> see.
>
>  * If we require servers to support external indexes we open up the whole
> federated-search-can-o-worms. Fx how to normalize hit ranking, do shared
> term frequencies (which is needed for "correct" ranking)
>
>  * It might make it trickier to do some hardcore optimizations in your
> native index format. Consider for example the case that the whole Lucene API
> relies pretty heavily on the fact that the docs are ordered sequentially in
> the index. Allowing the search engines to make private assumptions about the
> index structure can allow for some nice optimizations.
>
> My proposal to solve the "3rd party index problem" would simply be to have
> apps ships a huge xml file with the metadata (in the format we need to
> specify for shared harvesting anyway). This solves the two problems I
> mention above too. Given that the 3rd party data is stored in one big xml
> (or other more optimized) file to be indexed should make indexation pretty
> quick.
>
> It does give some problems in regards to having the index data on a per-user
> basis though. Nothing that can't be handled though.

 I think this match with my idea of the import/export format....

>
>
>>
>> management interface
>>
>> To make Xesam attractive for program authors, we need an API that the
>> allows applications to ensure that particular files are indexed. This
>> means that e.g. Amarok could tell the indexer 'index the title and
>> artist for all mp3 files on the system' or 'use this index that my
>> application provides'.
>> Do you think we could agree on designing an API for doing these things.
>
>
> We have discussed in the past that we want a common way to add data for
> indexation, and I think it is safe to assume that everybody wants that in
> some form or other. Comments:
>
>  * Do we not assume that the indexer will always index anything it can find
> and comprehend? In that case it should not be needed to tell the indexer to
> index anything you need. "If it is not indexed I can't index it".
>

 Totally agree with this answer.

>  * Index polution - if apps are allowed to add data to the index as they see
> fit they can also polute the index by ranking their data unduely high (if
> they are allowed to tweak rankings). Taking in the assumption that app
> authors are likely to think that *their* apps are the most important on the
> planet (except for mine which are more important), they will likely overrate
> their data.

 If the applications want to put information to the index, they must
do it with the proper API, not writing their own index and merge it
with "main" index. Mikkel pointed one problem (adulterate the score),
but there is also another big one: the applications should know the
index format and it sounds a very bad idea. If you want to change the
index file format... all the applications must update their code?

>> To close, some nice news from the people I talk to. The people seem
>> keen on using Xesam, not Tracker, Strigi or Beagle. They are not
>> bothered by being limited to the features in the spec but would like
>> the standardization process to go forward to add more. It is clear
>> that people like the choice which they get by being able to choose one
>> of a few standard compliant programs.

 In tracker we are also making some noise around Xesam. I think It is
a standard in the proper place :)

 Cheers,

Ivan