2008/5/17 Jos van den Oever <<a href="mailto:jvdoever@gmail.com">jvdoever@gmail.com</a>>:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hello all,<br>
<br>
Yesterday I blogged [1] about attending FOSSCamp. For the list I'd<br>
like to go into a bit more detail about two points I mentioned there.<br>
I'd like to do a small initial discussion on this topics and if we<br>
would like to start working on any of these, go to a separate on them.<br>
</blockquote><div><br>Thanks for the write up!<br><br>In retrospect my below comments probably sound more negative than I am. So let me first make it clear that I think all the points you raised can, and should be, addressed in some way or other :-)<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
common index file format<br>
<br>
Distributions want to install preindexed files such as documentation.<br>
All programs that implement Xesam would ideally implement support for<br>
one common format. Such a format may be heavily optimized for reading<br>
only and there should be a mechanism to tell the indexer about the<br>
presence of the index.<br>
For this to become a reality we need to look at what we need in such a<br>
file format. I want to avoid mentioning any technicalities but focus<br>
on requirements. Here's what I think such a format would need, please<br>
add more requirements.<br>
- cross platform<br>
- optimized for reading (we're talking about read-only files)<br>
- small indexes<br>
- good support for the full xesam search language features<br>
- good performance for many small indexes so distros can do<br>
fine-grained distribution<br>
</blockquote><div><br>I have several concerns about this. I know that you request feedback on the requirements, but I can't help but mentioning a few technical problems I see.<br><br> * If we require servers to support external indexes we open up the whole federated-search-can-o-worms. Fx how to normalize hit ranking, do shared term frequencies (which is needed for "correct" ranking)<br>
<br> * It might make it trickier to do some hardcore optimizations in your native index format. Consider for example the case that the whole Lucene API relies pretty heavily on the fact that the docs are ordered sequentially in the index. Allowing the search engines to make private assumptions about the index structure can allow for some nice optimizations.<br>
<br>My proposal to solve the "3rd party index problem" would simply be to have apps ships a huge xml file with the metadata (in the format we need to specify for shared harvesting anyway). This solves the two problems I mention above too. Given that the 3rd party data is stored in one big xml (or other more optimized) file to be indexed should make indexation pretty quick.<br>
<br>It does give some problems in regards to having the index data on a per-user basis though. Nothing that can't be handled though.<br><br></div><div> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
management interface<br>
<br>
To make Xesam attractive for program authors, we need an API that the<br>
allows applications to ensure that particular files are indexed. This<br>
means that e.g. Amarok could tell the indexer 'index the title and<br>
artist for all mp3 files on the system' or 'use this index that my<br>
application provides'.<br>
Do you think we could agree on designing an API for doing these things.</blockquote><div> <br>We have discussed in the past that we want a common way to add data for indexation, and I think it is safe to assume that everybody wants that in some form or other. Comments:<br>
<br> * Do we not assume that the indexer will always index anything it can find and comprehend? In that case it should not be needed to tell the indexer to index anything you need. "If it is not indexed I can't index it".<br>
<br> * Index polution - if apps are allowed to add data to the index as they see fit they can also polute the index by ranking their data unduely high (if they are allowed to tweak rankings). Taking in the assumption that app authors are likely to think that *their* apps are the most important on the planet (except for mine which are more important), they will likely overrate their data. <br>
<br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
To close, some nice news from the people I talk to. The people seem<br>
keen on using Xesam, not Tracker, Strigi or Beagle. They are not<br>
bothered by being limited to the features in the spec but would like<br>
the standardization process to go forward to add more. It is clear<br>
that people like the choice which they get by being able to choose one<br>
of a few standard compliant programs.</blockquote><div><br>Nice to hear!<br> <br></div></div>I should probably start a thread about how we should move on with Xesam...<br><br>Cheers,<br>Mikkel<br>