[Xesam] Need paged search mode for xesam
Jamie McCracken
jamie.mccrack at googlemail.com
Tue May 6 08:20:15 PDT 2008
you mean pull results over dbus and then page at client?
thats inefficient - pulling 10,000 hits over dbus is insanely slow (even
just the URI)
Paging is a must have im my book otherwise tracker api will have to be
used a lot instead of xesam whenever paged results are desired (more
likely we will add Paged search to xesam on top of the standard)
jamie
On Tue, 2008-05-06 at 17:12 +0200, Mikkel Kamstrup Erlandsen wrote:
> 2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
>
>
> On Tue, 2008-05-06 at 16:57 +0200, Mikkel Kamstrup Erlandsen
> wrote:
> > 2008/5/2 Mikkel Kamstrup Erlandsen
> <mikkel.kamstrup at gmail.com>:
> > I have a handful comments about this (Jos also asked
> about the
> > same on
> > IRC recently).
> > It was in fact a design decision, but i am writing
> this from
> > my mobile
> > since I'm on holiday, so I'll elaborate when I get
> home
> > tuesday.
> >
> > Cheers,
> > Mikkel
> >
> > As promised...
> >
> > Let's first establish some terminology. A Paged Model is one
> where you
> > can request hits with an offset and a count. A Streaming
> Model is one
> > like we have now, where you specify how many hits to read on
> each
> > request and then read hits sequentially (like file reading
> without
> > seeking).
> >
> > It should be noted that the Xesam Search spec is designed
> for desktop
> > search (and not generic search on a database or Google-style
> web
> > search with millions of hits). Furthermore it should be
> feasible to
> > implement in a host of different backends, not just full
> fledged
> > search engines.
> >
> > There are basically three backends where a paged model can
> be
> > problematic. Web services, Aggregated searches, and
> Grep/Find-like
> > implementations.
> >
> > * Web services. While Google's GData Query API does allow
> paging, not
> > all webservices does this. For example the OAI-PMH[1]
> standard does
> > not do paging, merely sequential reading. Ofcourse OAI-PMH
> is a
> > standard for harvesting metadata, but I could imagine a
> "search
> > engine" extracting metadata from the OAI-PMH result on the
> fly.
> >
> > * Aggregated search. Consider a setup where the Xesam
> search engine
> > is proxying a collection of other search engines. It is a
> classical
> > problem to look up hits 1000-1010 in this setup. The search
> engine
> > will have to retrieve the first 1010 hits from all
> sub-search engines
> > to get it right. Maybe there is a clever algorithm to do
> this more
> > cleverly, but I have not heard of it. This is ofcourse also
> a problem
> > in a streaming model, but it will not trick developers into
> believing
> > that GetHits(s, 1000, 1010) is a cheap call.
> >
> > * Grep-like backends or more generally backends where the
> search
> > results will roll in sequentially.
> >
> > I think it is a bad time to break the API like this. It is
> in fact a
> > quite big break if you ask me, since our current approach
> has been
> > stream-based and what you propose is changing the paradigm
> to a page
> > based model. Also bad because it is the wrong signal to send
> with such
> > and important change in the last minute.
> >
> > I see a few API-stable alternatives though.
> >
> > 1) Add a SeekHit(in s search, in i hit_id, out i new_pos).
> This
> > basically adds a cursoring mechanism to the API
> > 2) In style of 1) but lighter - add SkipHits(in s search, in
> i count,
> > out i new_pos)
> >
> > These options also stay within the standard streaming
> terminology. We
> > could make them optional by making them throw exceptions if
> the (new)
> > session property vendor.paging is True.
> >
> > As Jos also points out later in the thread GetHitData is
> actually
> > paging and the workaround he describes can actually be made
> very
> > efficient since we already have the hit.fields.extended
> session prop
> > to hint what properties we will fetch.
> >
> > Let me make it clear that I am not refusing the change to a
> paging
> > model if that is what the majority rules. We should just
> make an
> > informed decision that we are sure we agree on.
> >
>
>
>
> im proposing adding new api not breaking existing ones. The
> existing
> stuff can easily emulate paging if it lacks native support
>
> I would prefer new api that takes a start point param and a
> count/length
> param sow e have full random access
>
> And how is GetHitData not good enough for that?
>
> Cheers,
> Mikkel
>
More information about the Xesam
mailing list