[Xesam] Need paged search mode for xesam

Tue May 6 08:28:43 PDT 2008

On Tue, 2008-05-06 at 17:26 +0200, Mikkel Kamstrup Erlandsen wrote:
> 2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
>         you mean pull results over dbus and then page at client?
> 
> No. The signature of GetHitData is (in s search_handle, in au hit_ids,
> in as fields, out aav hits)
> 
> Ie you request which hits ids to fetch. To fetch a page pass [n, n
> +1, ..., n+page_size] as hit_ids.

hit_ids are not sequential!

we will use service_id for these which will be random in a search

>  
>         
>         thats inefficient - pulling 10,000 hits over dbus is insanely
>         slow (even
>         just the URI)
> 
> Hmmm, how slow is "insanely slow"? I doubt that this is true (by my
> standards of insanely slow).
>  
>         
>         Paging is a must have im my book otherwise tracker api will
>         have to be
>         used a lot instead of xesam whenever paged results are desired
>         (more
>         likely we will add Paged search to xesam on top of the
>         standard)
> 
> With a seekable API paging is easy to implement on the client.

we dont want to implement on client !

this is server based paging

>  
> Cheers,
> Mikkel
> 
> 
>         
>         
>         On Tue, 2008-05-06 at 17:12 +0200, Mikkel Kamstrup Erlandsen
>         wrote:
>         > 2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
>         >
>         >
>         >         On Tue, 2008-05-06 at 16:57 +0200, Mikkel Kamstrup
>         Erlandsen
>         >         wrote:
>         >         > 2008/5/2 Mikkel Kamstrup Erlandsen
>         >         <mikkel.kamstrup at gmail.com>:
>         >         >         I have a handful comments about this (Jos
>         also asked
>         >         about the
>         >         >         same on
>         >         >         IRC recently).
>         >         >         It was in fact a design decision, but i am
>         writing
>         >         this from
>         >         >         my mobile
>         >         >         since I'm on holiday, so I'll elaborate
>         when I get
>         >         home
>         >         >         tuesday.
>         >         >
>         >         >         Cheers,
>         >         >         Mikkel
>         >         >
>         >         > As promised...
>         >         >
>         >         > Let's first establish some terminology. A Paged
>         Model is one
>         >         where you
>         >         > can request hits with an offset and a count. A
>         Streaming
>         >         Model is one
>         >         > like we have now, where you specify how many hits
>         to read on
>         >         each
>         >         > request and then read hits sequentially (like file
>         reading
>         >         without
>         >         > seeking).
>         >         >
>         >         > It should be noted that the Xesam Search spec is
>         designed
>         >         for desktop
>         >         > search (and not generic search on a database or
>         Google-style
>         >         web
>         >         > search with millions of hits). Furthermore it
>         should be
>         >         feasible to
>         >         > implement in a host of different backends, not
>         just full
>         >         fledged
>         >         > search engines.
>         >         >
>         >         > There are basically three backends where a paged
>         model can
>         >         be
>         >         > problematic. Web services, Aggregated searches,
>         and
>         >         Grep/Find-like
>         >         > implementations.
>         >         >
>         >         >  * Web services. While Google's GData Query API
>         does allow
>         >         paging, not
>         >         > all webservices does this. For example the
>         OAI-PMH[1]
>         >         standard does
>         >         > not do paging, merely sequential reading. Ofcourse
>         OAI-PMH
>         >         is a
>         >         > standard for harvesting metadata, but I could
>         imagine a
>         >         "search
>         >         > engine" extracting metadata from the OAI-PMH
>         result on the
>         >         fly.
>         >         >
>         >         >  * Aggregated search. Consider a setup where the
>         Xesam
>         >         search engine
>         >         > is proxying a collection of other search engines.
>         It is a
>         >         classical
>         >         > problem to look up hits 1000-1010 in this setup.
>         The search
>         >         engine
>         >         > will have to retrieve the first 1010 hits from all
>         >         sub-search engines
>         >         > to get it right. Maybe there is a clever algorithm
>         to do
>         >         this  more
>         >         > cleverly, but I have not heard of it. This is
>         ofcourse also
>         >         a problem
>         >         > in a streaming model, but it will not trick
>         developers into
>         >         believing
>         >         > that GetHits(s, 1000, 1010) is a cheap call.
>         >         >
>         >         >  * Grep-like backends or more generally backends
>         where the
>         >         search
>         >         > results will roll in sequentially.
>         >         >
>         >         > I think it is a bad time to break the API like
>         this. It is
>         >         in fact a
>         >         > quite big break if you ask me, since our current
>         approach
>         >         has been
>         >         > stream-based and what you propose is changing the
>         paradigm
>         >         to a page
>         >         > based model. Also bad because it is the wrong
>         signal to send
>         >         with such
>         >         > and important change in the last minute.
>         >         >
>         >         > I see a few API-stable alternatives though.
>         >         >
>         >         > 1) Add a SeekHit(in s search, in i hit_id, out i
>         new_pos).
>         >         This
>         >         > basically adds a cursoring mechanism to the API
>         >         > 2) In style of 1) but lighter - add SkipHits(in s
>         search, in
>         >         i count,
>         >         > out i new_pos)
>         >         >
>         >         > These options also stay within the standard
>         streaming
>         >         terminology. We
>         >         > could make them optional by making them throw
>         exceptions if
>         >         the (new)
>         >         > session property vendor.paging is True.
>         >         >
>         >         > As Jos also points out later in the thread
>         GetHitData is
>         >         actually
>         >         > paging and the workaround he describes can
>         actually be made
>         >         very
>         >         > efficient since we already have the
>         hit.fields.extended
>         >         session prop
>         >         > to hint what properties we will fetch.
>         >         >
>         >         > Let me make it clear that I am not refusing the
>         change to a
>         >         paging
>         >         > model if that is what the majority rules. We
>         should just
>         >         make an
>         >         > informed decision that we are sure we agree on.
>         >         >
>         >
>         >
>         >
>         >         im proposing adding new api not breaking existing
>         ones. The
>         >         existing
>         >         stuff can easily emulate paging if it lacks native
>         support
>         >
>         >         I would prefer new api that takes a start point
>         param and a
>         >         count/length
>         >         param sow e have full random access
>         >
>         > And how is GetHitData not good enough for that?
>         >
>         > Cheers,
>         > Mikkel
>         >
>         
>         
>