[Xesam] Why is vendor.maxhits read-only?

Wed Dec 19 01:05:53 PST 2007

On 19/12/2007, Joe Shaw <joe at joeshaw.org> wrote:
> Hi,
>
> On 12/18/07, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> > Ok, I think get the picture now. Thanks for the detailed explanation.
> > I'm still not sure that I can accept that it should be so costly to
> > have a paging interface, but let us let that rest for now.
>
> The paging interface itself isn't necessarily the expensive part; it's
> creating a Lucene Document instance for every hit and then sorting
> them somehow by date.

I think the point is that I don't understand is why you need to
reconstruct the entire Document. I don't think Lucene is supposed to
be used like that. Lucene's sorting by a non-tokenized field should be
über fast.

The Xesam search API actually maps quite well to Lucene (by intent)
and I think it should be quite easy to create something very efficient
on top of Lucene (ie the part in mapping the Xesam API to Lucene is
easy - creating the rest of the indexer is not easy).

> Like I mentioned in my reply to Jamie, doing the paging if we have all
> this data is pretty trivial.  But getting that data is what's a pain.
> Without even considering CPU or memory usage, just think of the amount
> of time it would take to pull several hundred thousand hits off the
> disk.

Yes pulling entire Lucene Documents of the disc is a pain. But you
need a really, really, good reason to use Lucene like that.

> It's not an unreasonable use case.  When I was working at Novell my
> desktop had about 200k files and another 700k emails in addition to
> various other sources of data.  Searching for "gnome" understandably
> returned somewhere between 350-500k results.

Valid use case indeed.

I can match you on file count, but I think I need a lot more friends
or subscribe to more mailing lists :-)

Cheers,
Mikkel