[XESAM] API simplification?

Fri Jul 20 11:45:52 PDT 2007

2007/7/20, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com>:
> > I completely agree on all suggestions.
> > One more suggestion: the minimal interval between result signals
> > should be sane or settable.
>
> Valid point. To avoid signal spamming I take it. How about a session
> property  hit.batch.size that is an integer determining how many hits the
> server should collect before emitting HitsAdded. In case the entire index
> has been searched but < hit.batch.size hits has been found HitsAdded should
> be emitted(num_hits) right before SearchDone.

I would prefer setting this in terms of milliseconds, not number of
hits. Imagine you have the batch size at 100 and hits 1-99 are there
in 1 ms and hit #100 takes 20 seconds. That would not be so nice. If
you say that the time between signals must be at least 100 ms, you
solve the problem more elegantly.

> > On the topic of remembering the hits.
> > In ideal world, the server could be clever and get the right file from
> > the hit number. In reality, this is quite hard. Atm the server should
> > keep a vector with uris internally. I think we should allow the server
> > to have a sane maximum of hits that are retrievable. E.b. CountHits
> > might return 1 million, but you would only be able to retrieve the
> > first 100k.
>
> This makes sense given that  the scoring algorithms on servers are good
> enough. But judging by the extraordinary amount of talent we have in the
> server-side dev camp this is no problem of course :-)
The problem is not in the scoring algos, but in the changing data on
disk. If you do not get the list of uris at once, it may change due to
changes on the disk. I say we should ignore this problem as long as
the uri has not yet been requested and say that the result list is not
fixed until it is actually requested.

> How about a read-only session property search.maxhits? We could specify that
> in order to be xesam compliant this value must be > 1000 or something - just
> so that apps wont have to sanity checks galore.
Sounds good if used in addition to my suggestion above.

> > This is actually a scalability issue. We should allow the search to
> > modify the vector when the hit has not yet been retrieved and only
> > guarantee reproducibility for hits that were retrieved already. In
> > combination with a maximum history size this would handle most
> > performance problems.
>
> Yeah, we are handling the exact same problems at work :-) I think  we have
> solved it here (atleast up to 100M or so), but it is not exactly client side
> software...