[Xesam] Need paged search mode for xesam

Tue May 6 08:07:16 PDT 2008

On Tue, 2008-05-06 at 16:57 +0200, Mikkel Kamstrup Erlandsen wrote:
> 2008/5/2 Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com>:
>         I have a handful comments about this (Jos also asked about the
>         same on
>         IRC recently).
>         It was in fact a design decision, but i am writing this from
>         my mobile
>         since I'm on holiday, so I'll elaborate when I get home
>         tuesday.
>         
>         Cheers,
>         Mikkel
> 
> As promised...
> 
> Let's first establish some terminology. A Paged Model is one where you
> can request hits with an offset and a count. A Streaming Model is one
> like we have now, where you specify how many hits to read on each
> request and then read hits sequentially (like file reading without
> seeking).
> 
> It should be noted that the Xesam Search spec is designed for desktop
> search (and not generic search on a database or Google-style web
> search with millions of hits). Furthermore it should be feasible to
> implement in a host of different backends, not just full fledged
> search engines. 
> 
> There are basically three backends where a paged model can be
> problematic. Web services, Aggregated searches, and Grep/Find-like
> implementations.
> 
>  * Web services. While Google's GData Query API does allow paging, not
> all webservices does this. For example the OAI-PMH[1] standard does
> not do paging, merely sequential reading. Ofcourse OAI-PMH is a
> standard for harvesting metadata, but I could imagine a "search
> engine" extracting metadata from the OAI-PMH result on the fly.
> 
>  * Aggregated search. Consider a setup where the Xesam search engine
> is proxying a collection of other search engines. It is a classical
> problem to look up hits 1000-1010 in this setup. The search engine
> will have to retrieve the first 1010 hits from all sub-search engines
> to get it right. Maybe there is a clever algorithm to do this  more
> cleverly, but I have not heard of it. This is ofcourse also a problem
> in a streaming model, but it will not trick developers into believing
> that GetHits(s, 1000, 1010) is a cheap call.
>  
>  * Grep-like backends or more generally backends where the search
> results will roll in sequentially.
> 
> I think it is a bad time to break the API like this. It is in fact a
> quite big break if you ask me, since our current approach has been
> stream-based and what you propose is changing the paradigm to a page
> based model. Also bad because it is the wrong signal to send with such
> and important change in the last minute.
> 
> I see a few API-stable alternatives though.
> 
> 1) Add a SeekHit(in s search, in i hit_id, out i new_pos). This
> basically adds a cursoring mechanism to the API
> 2) In style of 1) but lighter - add SkipHits(in s search, in i count,
> out i new_pos)
> 
> These options also stay within the standard streaming terminology. We
> could make them optional by making them throw exceptions if the (new)
> session property vendor.paging is True.
> 
> As Jos also points out later in the thread GetHitData is actually
> paging and the workaround he describes can actually be made very
> efficient since we already have the hit.fields.extended session prop
> to hint what properties we will fetch.
> 
> Let me make it clear that I am not refusing the change to a paging
> model if that is what the majority rules. We should just make an
> informed decision that we are sure we agree on.
> 

im proposing adding new api not breaking existing ones. The existing
stuff can easily emulate paging if it lacks native support

I would prefer new api that takes a start point param and a count/length
param sow e have full random access

jamie