[Xesam] Need paged search mode for xesam
Mikkel Kamstrup Erlandsen
mikkel.kamstrup at gmail.com
Tue May 6 08:26:17 PDT 2008
2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
> you mean pull results over dbus and then page at client?
>
No. The signature of GetHitData is (in s search_handle, in au hit_ids, in as
fields, out aav hits)
Ie you request which hits ids to fetch. To fetch a page pass [n, n+1, ...,
n+page_size] as hit_ids.
>
> thats inefficient - pulling 10,000 hits over dbus is insanely slow (even
> just the URI)
>
Hmmm, how slow is "insanely slow"? I doubt that this is true (by my
standards of insanely slow).
>
> Paging is a must have im my book otherwise tracker api will have to be
> used a lot instead of xesam whenever paged results are desired (more
> likely we will add Paged search to xesam on top of the standard)
>
With a seekable API paging is easy to implement on the client.
Cheers,
Mikkel
>
> On Tue, 2008-05-06 at 17:12 +0200, Mikkel Kamstrup Erlandsen wrote:
> > 2008/5/6 Jamie McCracken <jamie.mccrack at googlemail.com>:
> >
> >
> > On Tue, 2008-05-06 at 16:57 +0200, Mikkel Kamstrup Erlandsen
> > wrote:
> > > 2008/5/2 Mikkel Kamstrup Erlandsen
> > <mikkel.kamstrup at gmail.com>:
> > > I have a handful comments about this (Jos also asked
> > about the
> > > same on
> > > IRC recently).
> > > It was in fact a design decision, but i am writing
> > this from
> > > my mobile
> > > since I'm on holiday, so I'll elaborate when I get
> > home
> > > tuesday.
> > >
> > > Cheers,
> > > Mikkel
> > >
> > > As promised...
> > >
> > > Let's first establish some terminology. A Paged Model is one
> > where you
> > > can request hits with an offset and a count. A Streaming
> > Model is one
> > > like we have now, where you specify how many hits to read on
> > each
> > > request and then read hits sequentially (like file reading
> > without
> > > seeking).
> > >
> > > It should be noted that the Xesam Search spec is designed
> > for desktop
> > > search (and not generic search on a database or Google-style
> > web
> > > search with millions of hits). Furthermore it should be
> > feasible to
> > > implement in a host of different backends, not just full
> > fledged
> > > search engines.
> > >
> > > There are basically three backends where a paged model can
> > be
> > > problematic. Web services, Aggregated searches, and
> > Grep/Find-like
> > > implementations.
> > >
> > > * Web services. While Google's GData Query API does allow
> > paging, not
> > > all webservices does this. For example the OAI-PMH[1]
> > standard does
> > > not do paging, merely sequential reading. Ofcourse OAI-PMH
> > is a
> > > standard for harvesting metadata, but I could imagine a
> > "search
> > > engine" extracting metadata from the OAI-PMH result on the
> > fly.
> > >
> > > * Aggregated search. Consider a setup where the Xesam
> > search engine
> > > is proxying a collection of other search engines. It is a
> > classical
> > > problem to look up hits 1000-1010 in this setup. The search
> > engine
> > > will have to retrieve the first 1010 hits from all
> > sub-search engines
> > > to get it right. Maybe there is a clever algorithm to do
> > this more
> > > cleverly, but I have not heard of it. This is ofcourse also
> > a problem
> > > in a streaming model, but it will not trick developers into
> > believing
> > > that GetHits(s, 1000, 1010) is a cheap call.
> > >
> > > * Grep-like backends or more generally backends where the
> > search
> > > results will roll in sequentially.
> > >
> > > I think it is a bad time to break the API like this. It is
> > in fact a
> > > quite big break if you ask me, since our current approach
> > has been
> > > stream-based and what you propose is changing the paradigm
> > to a page
> > > based model. Also bad because it is the wrong signal to send
> > with such
> > > and important change in the last minute.
> > >
> > > I see a few API-stable alternatives though.
> > >
> > > 1) Add a SeekHit(in s search, in i hit_id, out i new_pos).
> > This
> > > basically adds a cursoring mechanism to the API
> > > 2) In style of 1) but lighter - add SkipHits(in s search, in
> > i count,
> > > out i new_pos)
> > >
> > > These options also stay within the standard streaming
> > terminology. We
> > > could make them optional by making them throw exceptions if
> > the (new)
> > > session property vendor.paging is True.
> > >
> > > As Jos also points out later in the thread GetHitData is
> > actually
> > > paging and the workaround he describes can actually be made
> > very
> > > efficient since we already have the hit.fields.extended
> > session prop
> > > to hint what properties we will fetch.
> > >
> > > Let me make it clear that I am not refusing the change to a
> > paging
> > > model if that is what the majority rules. We should just
> > make an
> > > informed decision that we are sure we agree on.
> > >
> >
> >
> >
> > im proposing adding new api not breaking existing ones. The
> > existing
> > stuff can easily emulate paging if it lacks native support
> >
> > I would prefer new api that takes a start point param and a
> > count/length
> > param sow e have full random access
> >
> > And how is GetHitData not good enough for that?
> >
> > Cheers,
> > Mikkel
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xesam/attachments/20080506/1dd4d9b3/attachment.htm
More information about the Xesam
mailing list