[Xesam] Need paged search mode for xesam

Tue May 6 08:20:15 PDT 2008

you mean pull results over dbus and then page at client?

thats inefficient - pulling 10,000 hits over dbus is insanely slow (even
just the URI) 

Paging is a must have im my book otherwise tracker api will have to be
used a lot instead of xesam whenever paged results are desired (more
likely we will add Paged search to xesam on top of the standard)

jamie

On Tue, 2008-05-06 at 17:12 +0200, Mikkel Kamstrup Erlandsen wrote:
> 2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
>         
>         
>         On Tue, 2008-05-06 at 16:57 +0200, Mikkel Kamstrup Erlandsen
>         wrote:
>         > 2008/5/2 Mikkel Kamstrup Erlandsen
>         <mikkel.kamstrup at gmail.com>:
>         >         I have a handful comments about this (Jos also asked
>         about the
>         >         same on
>         >         IRC recently).
>         >         It was in fact a design decision, but i am writing
>         this from
>         >         my mobile
>         >         since I'm on holiday, so I'll elaborate when I get
>         home
>         >         tuesday.
>         >
>         >         Cheers,
>         >         Mikkel
>         >
>         > As promised...
>         >
>         > Let's first establish some terminology. A Paged Model is one
>         where you
>         > can request hits with an offset and a count. A Streaming
>         Model is one
>         > like we have now, where you specify how many hits to read on
>         each
>         > request and then read hits sequentially (like file reading
>         without
>         > seeking).
>         >
>         > It should be noted that the Xesam Search spec is designed
>         for desktop
>         > search (and not generic search on a database or Google-style
>         web
>         > search with millions of hits). Furthermore it should be
>         feasible to
>         > implement in a host of different backends, not just full
>         fledged
>         > search engines.
>         >
>         > There are basically three backends where a paged model can
>         be
>         > problematic. Web services, Aggregated searches, and
>         Grep/Find-like
>         > implementations.
>         >
>         >  * Web services. While Google's GData Query API does allow
>         paging, not
>         > all webservices does this. For example the OAI-PMH[1]
>         standard does
>         > not do paging, merely sequential reading. Ofcourse OAI-PMH
>         is a
>         > standard for harvesting metadata, but I could imagine a
>         "search
>         > engine" extracting metadata from the OAI-PMH result on the
>         fly.
>         >
>         >  * Aggregated search. Consider a setup where the Xesam
>         search engine
>         > is proxying a collection of other search engines. It is a
>         classical
>         > problem to look up hits 1000-1010 in this setup. The search
>         engine
>         > will have to retrieve the first 1010 hits from all
>         sub-search engines
>         > to get it right. Maybe there is a clever algorithm to do
>         this  more
>         > cleverly, but I have not heard of it. This is ofcourse also
>         a problem
>         > in a streaming model, but it will not trick developers into
>         believing
>         > that GetHits(s, 1000, 1010) is a cheap call.
>         >
>         >  * Grep-like backends or more generally backends where the
>         search
>         > results will roll in sequentially.
>         >
>         > I think it is a bad time to break the API like this. It is
>         in fact a
>         > quite big break if you ask me, since our current approach
>         has been
>         > stream-based and what you propose is changing the paradigm
>         to a page
>         > based model. Also bad because it is the wrong signal to send
>         with such
>         > and important change in the last minute.
>         >
>         > I see a few API-stable alternatives though.
>         >
>         > 1) Add a SeekHit(in s search, in i hit_id, out i new_pos).
>         This
>         > basically adds a cursoring mechanism to the API
>         > 2) In style of 1) but lighter - add SkipHits(in s search, in
>         i count,
>         > out i new_pos)
>         >
>         > These options also stay within the standard streaming
>         terminology. We
>         > could make them optional by making them throw exceptions if
>         the (new)
>         > session property vendor.paging is True.
>         >
>         > As Jos also points out later in the thread GetHitData is
>         actually
>         > paging and the workaround he describes can actually be made
>         very
>         > efficient since we already have the hit.fields.extended
>         session prop
>         > to hint what properties we will fetch.
>         >
>         > Let me make it clear that I am not refusing the change to a
>         paging
>         > model if that is what the majority rules. We should just
>         make an
>         > informed decision that we are sure we agree on.
>         >
>         
>         
>         
>         im proposing adding new api not breaking existing ones. The
>         existing
>         stuff can easily emulate paging if it lacks native support
>         
>         I would prefer new api that takes a start point param and a
>         count/length
>         param sow e have full random access
> 
> And how is GetHitData not good enough for that?
> 
> Cheers,
> Mikkel 
>