[Xesam] Why is vendor.maxhits read-only?

Tue Dec 18 13:43:12 PST 2007

On 18/12/2007, Joe Shaw <joe at joeshaw.org> wrote:
> Hi,
>
> On 12/18/07, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> > Yeah, I understand that. I was wondering why the default is not just
> > MAXINT? Why is it that it require "quite a bit more memory and CPU" as
> > you say below? This is not my usual experience using Lucene on several
> > millions of docs.
>
> Well, as you also know with Lucene the doc IDs are pretty useless by
> themselves.  The real cost is in turning those doc IDs into Document
> instances.  We need to do that for a few reasons, and creating them is
> costly (in disk seeks) and keeping them around is costly (in memory
> usage).  The processing we do on them (for instance, to sort them by
> date) is costly in CPU usage.
>
> > Here's another shot then :-) This get semi-Lucene-technical so hang on...
> >
> > By "sequential" I mean that GetHits(count) is basically a read()
> > method that reads sequentially through the result set.
> >
> > When you receive your results from Lucene you basically get an
> > iterator over the doc ids. Then I assume from your description that
> > you fetch the relevant (stored?) fields for the first 100 hits in that
> > iterator.
>
> If only it were that simple. :)  We're actually searching across two
> indexes and merging their results based on whether a Uri (a field that
> has to be extracted from a Document) match into a bitarray.  Then from
> that bitarray we sort the results based on date.  Lucene relevancy (at
> least, the default tf-idf algorithm) is pretty useless when you're
> dealing with very heterogeneous documents, as we are.  Also given that
> each backend has its own pair of Lucene indexes, normalizing and
> sorting those relevancies would also be necessary and a pain.
>
> > When beagled/xesam-adaptor gets a GetHits(150) it can say "uh, I don't
> > have that many hits cached, I better build the next 50 too". It can
> > then hold on to the iterator in case any more hits are requested (or
> > maybe even pre-fetch the next 100). When the search is closed it can
> > discard the iterator and cached hits.
>
> Well, the main problem here is that Beagle doesn't have a "paging"
> interface.  There's no way to say, "give me the next 50".  That might
> be possible with a requery, but it doesn't exist today.  Beagle stops
> building its Hits (which are the results and their properties) once it
> hits its max count.

Ok, I think get the picture now. Thanks for the detailed explanation.
I'm still not sure that I can accept that it should be so costly to
have a paging interface, but let us let that rest for now.

Back to the original problem - "should there be a settable max hit
count in the session properties of xesam?".

I think we need to pinpoint what exactly it is we need, so we don't go
about an introduce session props all over the place. Here's a dump of
my thoughts...

What the real deal is about is that it will be handy for the search
engine to be able to anticipate the behavior of the client and
optimize accordingly.

In Anders' case with tag harvesting we really need to be able to
retrieve each and every document even if it is 1M docs in total. One
option is ofcourse to leave it as it is and just say that this is
simply not possible in xesam 1.0. I can accept that (maybe others
can't).

Other intentions/requirements the client might have include:

 1) Whether or not it is interested in updates to the result set. This
allows the search engines to discard the results immediately after
being read.

 2) If it *must* get the full result set - ie Anders' case again. If
the result set is incomplete the app will fail or produce buggy data.

 3) That it is not an end-user app, but more a "harvester" doing
additional analysis on the search results. These clients will tend to
act like Anders' client and do greedy consumption of the search
results.

Just allowing a settable hit-batch-size does not fix Anders' problem
because 2) is still not known to the search engine.

Cheers,
Mikkel