[Xesam] Need paged search mode for xesam

Tue May 6 08:26:17 PDT 2008

2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:

> you mean pull results over dbus and then page at client?
>

No. The signature of GetHitData is (in s search_handle, in au hit_ids, in as
fields, out aav hits)

Ie you request which hits ids to fetch. To fetch a page pass [n, n+1, ...,
n+page_size] as hit_ids.

>
> thats inefficient - pulling 10,000 hits over dbus is insanely slow (even
> just the URI)
>

Hmmm, how slow is "insanely slow"? I doubt that this is true (by my
standards of insanely slow).

>
> Paging is a must have im my book otherwise tracker api will have to be
> used a lot instead of xesam whenever paged results are desired (more
> likely we will add Paged search to xesam on top of the standard)
>

With a seekable API paging is easy to implement on the client.

Cheers,
Mikkel

>
> On Tue, 2008-05-06 at 17:12 +0200, Mikkel Kamstrup Erlandsen wrote:
> > 2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
> >
> >
> >         On Tue, 2008-05-06 at 16:57 +0200, Mikkel Kamstrup Erlandsen
> >         wrote:
> >         > 2008/5/2 Mikkel Kamstrup Erlandsen
> >         <mikkel.kamstrup at gmail.com>:
> >         >         I have a handful comments about this (Jos also asked
> >         about the
> >         >         same on
> >         >         IRC recently).
> >         >         It was in fact a design decision, but i am writing
> >         this from
> >         >         my mobile
> >         >         since I'm on holiday, so I'll elaborate when I get
> >         home
> >         >         tuesday.
> >         >
> >         >         Cheers,
> >         >         Mikkel
> >         >
> >         > As promised...
> >         >
> >         > Let's first establish some terminology. A Paged Model is one
> >         where you
> >         > can request hits with an offset and a count. A Streaming
> >         Model is one
> >         > like we have now, where you specify how many hits to read on
> >         each
> >         > request and then read hits sequentially (like file reading
> >         without
> >         > seeking).
> >         >
> >         > It should be noted that the Xesam Search spec is designed
> >         for desktop
> >         > search (and not generic search on a database or Google-style
> >         web
> >         > search with millions of hits). Furthermore it should be
> >         feasible to
> >         > implement in a host of different backends, not just full
> >         fledged
> >         > search engines.
> >         >
> >         > There are basically three backends where a paged model can
> >         be
> >         > problematic. Web services, Aggregated searches, and
> >         Grep/Find-like
> >         > implementations.
> >         >
> >         >  * Web services. While Google's GData Query API does allow
> >         paging, not
> >         > all webservices does this. For example the OAI-PMH[1]
> >         standard does
> >         > not do paging, merely sequential reading. Ofcourse OAI-PMH
> >         is a
> >         > standard for harvesting metadata, but I could imagine a
> >         "search
> >         > engine" extracting metadata from the OAI-PMH result on the
> >         fly.
> >         >
> >         >  * Aggregated search. Consider a setup where the Xesam
> >         search engine
> >         > is proxying a collection of other search engines. It is a
> >         classical
> >         > problem to look up hits 1000-1010 in this setup. The search
> >         engine
> >         > will have to retrieve the first 1010 hits from all
> >         sub-search engines
> >         > to get it right. Maybe there is a clever algorithm to do
> >         this  more
> >         > cleverly, but I have not heard of it. This is ofcourse also
> >         a problem
> >         > in a streaming model, but it will not trick developers into
> >         believing
> >         > that GetHits(s, 1000, 1010) is a cheap call.
> >         >
> >         >  * Grep-like backends or more generally backends where the
> >         search
> >         > results will roll in sequentially.
> >         >
> >         > I think it is a bad time to break the API like this. It is
> >         in fact a
> >         > quite big break if you ask me, since our current approach
> >         has been
> >         > stream-based and what you propose is changing the paradigm
> >         to a page
> >         > based model. Also bad because it is the wrong signal to send
> >         with such
> >         > and important change in the last minute.
> >         >
> >         > I see a few API-stable alternatives though.
> >         >
> >         > 1) Add a SeekHit(in s search, in i hit_id, out i new_pos).
> >         This
> >         > basically adds a cursoring mechanism to the API
> >         > 2) In style of 1) but lighter - add SkipHits(in s search, in
> >         i count,
> >         > out i new_pos)
> >         >
> >         > These options also stay within the standard streaming
> >         terminology. We
> >         > could make them optional by making them throw exceptions if
> >         the (new)
> >         > session property vendor.paging is True.
> >         >
> >         > As Jos also points out later in the thread GetHitData is
> >         actually
> >         > paging and the workaround he describes can actually be made
> >         very
> >         > efficient since we already have the hit.fields.extended
> >         session prop
> >         > to hint what properties we will fetch.
> >         >
> >         > Let me make it clear that I am not refusing the change to a
> >         paging
> >         > model if that is what the majority rules. We should just
> >         make an
> >         > informed decision that we are sure we agree on.
> >         >
> >
> >
> >
> >         im proposing adding new api not breaking existing ones. The
> >         existing
> >         stuff can easily emulate paging if it lacks native support
> >
> >         I would prefer new api that takes a start point param and a
> >         count/length
> >         param sow e have full random access
> >
> > And how is GetHitData not good enough for that?
> >
> > Cheers,
> > Mikkel
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xesam/attachments/20080506/1dd4d9b3/attachment.htm