[Xesam] Why is vendor.maxhits read-only?

Joe Shaw joe at joeshaw.org
Wed Dec 19 07:03:09 PST 2007


Hi,

On 12/19/07, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> I think the point is that I don't understand is why you need to
> reconstruct the entire Document. I don't think Lucene is supposed to
> be used like that. Lucene's sorting by a non-tokenized field should be
> über fast.

Are you talking about Lucene's FieldSelector?  That allows you to
create a Document instance with only some of the fields loaded off
disk.  If so, that's a pretty new feature of Lucene (within the last
year) and one which hasn't made it into the .Net version yet.  We did
forward port it, though, and we use it in a few places.  We don't use
it in the timestamp case, actually, because filters can do additional
checks on other properties and reject them as results (imagine a file
which doesn't exist on disk but is still in the index for whatever
reason).  But that's mostly an implementation detail.

If you're talking about iterating across terms in the Lucene index
using TermEnum, that is something we do.  Walking the terms within a
field is sorted, and it's about 2.5x faster than building a document,
from my profiling.  But if you have a million documents in your index
and only 5000 matches for a given query, it's faster to build the
documents for all 5000 matches and keep the top 100 than it is to walk
across all 1 million terms (although you would short circuit if you
hit the 5000th match earlier than the 1 millionth document.)

> Yes pulling entire Lucene Documents of the disc is a pain. But you
> need a really, really, good reason to use Lucene like that.

Well, you have to pull Documents sooner or later.  It's what has all
of your fields, and that's where we store our metadata.

> > It's not an unreasonable use case.  When I was working at Novell my
> > desktop had about 200k files and another 700k emails in addition to
> > various other sources of data.  Searching for "gnome" understandably
> > returned somewhere between 350-500k results.
>
> Valid use case indeed.
>
> I can match you on file count, but I think I need a lot more friends
> or subscribe to more mailing lists :-)

Heh, yeah.  Mailing lists (including GNOME CVS/SVN commits) for 11 years. :)

Joe


More information about the Xesam mailing list