[Xesam] Why is vendor.maxhits read-only?
Mikkel Kamstrup Erlandsen
mikkel.kamstrup at gmail.com
Wed Dec 19 01:05:53 PST 2007
On 19/12/2007, Joe Shaw <joe at joeshaw.org> wrote:
> Hi,
>
> On 12/18/07, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> > Ok, I think get the picture now. Thanks for the detailed explanation.
> > I'm still not sure that I can accept that it should be so costly to
> > have a paging interface, but let us let that rest for now.
>
> The paging interface itself isn't necessarily the expensive part; it's
> creating a Lucene Document instance for every hit and then sorting
> them somehow by date.
I think the point is that I don't understand is why you need to
reconstruct the entire Document. I don't think Lucene is supposed to
be used like that. Lucene's sorting by a non-tokenized field should be
über fast.
The Xesam search API actually maps quite well to Lucene (by intent)
and I think it should be quite easy to create something very efficient
on top of Lucene (ie the part in mapping the Xesam API to Lucene is
easy - creating the rest of the indexer is not easy).
> Like I mentioned in my reply to Jamie, doing the paging if we have all
> this data is pretty trivial. But getting that data is what's a pain.
> Without even considering CPU or memory usage, just think of the amount
> of time it would take to pull several hundred thousand hits off the
> disk.
Yes pulling entire Lucene Documents of the disc is a pain. But you
need a really, really, good reason to use Lucene like that.
> It's not an unreasonable use case. When I was working at Novell my
> desktop had about 200k files and another 700k emails in addition to
> various other sources of data. Searching for "gnome" understandably
> returned somewhere between 350-500k results.
Valid use case indeed.
I can match you on file count, but I think I need a lot more friends
or subscribe to more mailing lists :-)
Cheers,
Mikkel
More information about the Xesam
mailing list