[Xesam] Why is vendor.maxhits read-only?

Tue Dec 18 12:57:53 PST 2007

Hi,

On 12/18/07, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> Yeah, I understand that. I was wondering why the default is not just
> MAXINT? Why is it that it require "quite a bit more memory and CPU" as
> you say below? This is not my usual experience using Lucene on several
> millions of docs.

Well, as you also know with Lucene the doc IDs are pretty useless by
themselves.  The real cost is in turning those doc IDs into Document
instances.  We need to do that for a few reasons, and creating them is
costly (in disk seeks) and keeping them around is costly (in memory
usage).  The processing we do on them (for instance, to sort them by
date) is costly in CPU usage.

> Here's another shot then :-) This get semi-Lucene-technical so hang on...
>
> By "sequential" I mean that GetHits(count) is basically a read()
> method that reads sequentially through the result set.
>
> When you receive your results from Lucene you basically get an
> iterator over the doc ids. Then I assume from your description that
> you fetch the relevant (stored?) fields for the first 100 hits in that
> iterator.

If only it were that simple. :)  We're actually searching across two
indexes and merging their results based on whether a Uri (a field that
has to be extracted from a Document) match into a bitarray.  Then from
that bitarray we sort the results based on date.  Lucene relevancy (at
least, the default tf-idf algorithm) is pretty useless when you're
dealing with very heterogeneous documents, as we are.  Also given that
each backend has its own pair of Lucene indexes, normalizing and
sorting those relevancies would also be necessary and a pain.

> When beagled/xesam-adaptor gets a GetHits(150) it can say "uh, I don't
> have that many hits cached, I better build the next 50 too". It can
> then hold on to the iterator in case any more hits are requested (or
> maybe even pre-fetch the next 100). When the search is closed it can
> discard the iterator and cached hits.

Well, the main problem here is that Beagle doesn't have a "paging"
interface.  There's no way to say, "give me the next 50".  That might
be possible with a requery, but it doesn't exist today.  Beagle stops
building its Hits (which are the results and their properties) once it
hits its max count.

Joe