[Xesam] Need paged search mode for xesam

Tue May 6 07:57:22 PDT 2008

2008/5/2 Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com>:

> I have a handful comments about this (Jos also asked about the same on
> IRC recently).
> It was in fact a design decision, but i am writing this from my mobile
> since I'm on holiday, so I'll elaborate when I get home tuesday.
>
> Cheers,
> Mikkel
>

As promised...

Let's first establish some terminology. A Paged Model is one where you can
request hits with an offset and a count. A Streaming Model is one like we
have now, where you specify how many hits to read on each request and then
read hits sequentially (like file reading without seeking).

It should be noted that the Xesam Search spec is designed for desktop search
(and not generic search on a database or Google-style web search with
millions of hits). Furthermore it should be feasible to implement in a host
of different backends, not just full fledged search engines.

There are basically three backends where a paged model can be problematic.
Web services, Aggregated searches, and Grep/Find-like implementations.

 * Web services. While Google's GData Query API does allow paging, not all
webservices does this. For example the OAI-PMH[1] standard does not do
paging, merely sequential reading. Ofcourse OAI-PMH is a standard for
harvesting metadata, but I could imagine a "search engine" extracting
metadata from the OAI-PMH result on the fly.

 * Aggregated search. Consider a setup where the Xesam search engine is
proxying a collection of other search engines. It is a classical problem to
look up hits 1000-1010 in this setup. The search engine will have to
retrieve the first 1010 hits from all sub-search engines to get it right.
Maybe there is a clever algorithm to do this  more cleverly, but I have not
heard of it. This is ofcourse also a problem in a streaming model, but it
will not trick developers into believing that GetHits(s, 1000, 1010) is a
cheap call.

 * Grep-like backends or more generally backends where the search results
will roll in sequentially.

I think it is a bad time to break the API like this. It is in fact a quite
big break if you ask me, since our current approach has been stream-based
and what you propose is changing the paradigm to a page based model. Also
bad because it is the wrong signal to send with such and important change in
the last minute.

I see a few API-stable alternatives though.

1) Add a SeekHit(in s search, in i hit_id, out i new_pos). This basically
adds a cursoring mechanism to the API
2) In style of 1) but lighter - add SkipHits(in s search, in i count, out i
new_pos)

These options also stay within the standard streaming terminology. We could
make them optional by making them throw exceptions if the (new) session
property vendor.paging is True.

As Jos also points out later in the thread GetHitData is actually paging and
the workaround he describes can actually be made very efficient since we
already have the hit.fields.extended session prop to hint what properties we
will fetch.

Let me make it clear that I am not refusing the change to a paging model if
that is what the majority rules. We should just make an informed decision
that we are sure we agree on.

Cheers,
Mikkel

[1]: http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm

>
> 2008/5/2, Jamie McCracken <jamie.mccrack at googlemail.com>:
> > For a serch gui its essential to page results using an offset and limit
> > to define the page size
> >
> > currently xesam api lacks the offset component (although it has a limit)
> >
> > there are several workarounds:
> >
> > 1) add a hit.offset property
> > 2) add new api : GetPagedHits (in string search, in int PageStart, in
> > int PageEnd, out aav results) or similar
> > 3) add a hit.pagesize property and have GetNextpage/getPrevPage methods
> >
> > Anyway we desperately need this to make things fast otherwise putting a
> > huge result set over dbus is gonna be awfully slow
> >
> > jamie
> >
> >
> >
> > _______________________________________________
> > Xesam mailing list
> > Xesam at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/xesam
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xesam/attachments/20080506/38b3b753/attachment.html