[Xesam] Why is vendor.maxhits read-only?

Joe Shaw joe at joeshaw.org
Wed Dec 26 14:56:57 PST 2007


Hi,

Sorry about the delay, I was away.

On 12/19/07, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com> wrote:
> From what I understood you needed to build Documents because you
> wanted to do post sorting on some of the fields. Lucene's
> Searcher.search(Query q, Sort s) is really fast and the sorter has
> access to non-tokenized fields without creating the Documents (even
> non-stored ones). This requires storing the mtime as an integer or
> long though.

Searcher.search(Query q, Sort s) only sorts a subset of documents
returned from a search.  By default, 100.  The sort doesn't apply over
the entire result space.

Things tend to "just work" however because the field sorters assert
that the documents are indexed in order.  That is, if you're sorting
by a timestamp field, you have to index the oldest document first and
the newest last.  Beagle doesn't work that way -- it indexes files as
it comes across them -- and it would be prohibitive to try to do
otherwise.

Because we have to search two Lucene indexes for one set of results
and because we have to potentially walk the entire result space, we
use a much lower level API than the one which returns Hits
collections.  It seems from the Lucene mailing lists that use of the
Hits API is largely discouraged in most non-trivial search
applcations.

> Also assuming that you don't have more than a few stored fields it
> should still be fairly fast to create the Documents via Hits.doc(int
> i) since it only adds the stored fields to the doc.

All of our metadata is stored fields.  Timestamp, MIME type, file
name, email subject line, etc.  So there is a fair amount of stored
fields for each document.  Remember, the penalty here is disk seek
time, not the amount of data pulled off disk.

> One hack we use at work is to encode the needed field data in one
> stored field and then parse that blob for each hit and using the data
> for display.

Yeah, this may be something we could do for Beagle, but newer Lucenes
allow us to pull stuff on demand.  That's probably the biggest gain we
can get.

> What does "pull" mean exactly in the case in point? Just calling
> Hits.doc(i) or is it a full rebuilding of the Document as it was added
> to the index? I guess I've read it as more than doing a Hits.doc() at
> least...

Calling IndexSearcher.Doc(doc_id), which results in a Document object.

Joe


More information about the Xesam mailing list