[Xesam] Why is vendor.maxhits read-only?

Mon Dec 17 13:47:37 PST 2007

On 17/12/2007, Anders Rune Jensen <anders at iola.dk> wrote:
> On Dec 17, 2007 6:56 PM, Mikkel Kamstrup Erlandsen
> <mikkel.kamstrup at gmail.com> wrote:
> > On 17/12/2007, Anders Rune Jensen <anders at iola.dk> wrote:
> > > Hi
> > >
> > > I was wondering why vendor.maxhits is read-only? Beagle can natively
> > > set this, so it would be really nice to be able to set this using
> > > Xesam as well.
> > >
> > > Thanks
> >
> > No this does not make sense to be writable. Perhaps it is because the
> > explanation is bad. Here's another try:
> >
> > vendor.maxhits is a hard implementation level on the maximum number of
> > hits returnable. If you write a Lucene based indexer this will be your
> > JVMs Integer.MAXINT other indexing frameworks might set other limits
> > (or none).
> >
> > An example by the hand is a Google query. You can maximally retrieve
> > 10.000 docs from a Google query - try it yourself (this has to do with
> > the distributed nature of the Google search engine - it is hard to get
> > rankings correct if you allow arbitrarily many docs to be fetched).
> >
> > The fucntionality you describe is also easily implemented on the
> > client side. I did this in xesam-tools' xesam.ui.HitPagerModel for
> > example..
>
> Ok, maybe I misunderstand this completely but it isn't always easy to
> get all the details from source in a jif (even if it's Python :-)).

No not necessarily. I think you are thinking about databases, in
(most) database systems it is almost free to look up the values of the
fields. This is not necessarily so in a Lucene index for instance.

> So what you suggest is that the cap is set very high on the number of
> results returned from the xesam backend and then you just disregard
> what you don't need on the client side? Is this what you meant or am I
> misunderstanding something?

Hmmm, sounds like it :-) Here's a cap of a session as it could very
well transpire:

1) A client start a search 'sh' an a session 'ses'.

2) The server performs a query over a few indexes and find the first
batch of hits with N hits and emits HitsAdded(sh,N). In a Lucene world
the server would now hold doc-ids which are just integer handles for
each hit.

3) Client detects the HitsAdded(sh, N) signal and request data for M <
N hits, by calling GetHits(sh, M)

4) The server collects the hit data for hits 0..M and returns it

5) Client receive hit data for hits 0..M and displays it to the user

6) Server finds Q more hits  and emits HitsAdded(sh, Q)

7) Client does not need more hits and does nothing

8) The server waits on more requests for its or until search or
session is closed

So the client does not "disregard what it doesn't need", but instead
only requests data for what it needs.

> I could really see the usefulness of a flag to tell the backend how
> many results I'm interested it.

Yes, I can certainly see the use. We already discussed this a good
while back on xdg, and it was turned down[1].

Anyways it is a more complicated matter than it might appear - as
pointed out in [1] what if I request a batch size of 100 and the
server finds 99 hits in the first go. It might very well be impossible
for the server to tell if more hits are inbound, or it should just
fire HitsAdded(99)...

If there really is a need, this could be added in a later revision
without breaking anything.

Cheers,
Mikkel

[1]: http://lists.freedesktop.org/archives/xdg/2007-July/008655.html