Simple search API proposal, take 2
Magnus Bergman
magnus.bergman at observer.net
Fri Jan 19 06:24:56 PST 2007
On Thu, 18 Jan 2007 22:15:04 +0100
"Mikkel Kamstrup Erlandsen" <mikkel.kamstrup at gmail.com> wrote:
> About the Simple api:
>
> Picking up on an "old" issue. There was semi-agreement that the
> properties to be retrieved from a Query should be passed to GetHits,
> as GetHits(handle, offset, limit, props).
>
> Each time I look at this it makes me feel very weird. Are you guys
> sure this is what we want? Think about stateless find/grep
> implementations, ultra light search daemons... I would guess that at
> least some servers might benefit greatly from knowing what fields to
> extract as soon as possible.
>
> To this end I re-propose to add the props list to the Query
> constructor. Giving us an api like:
>
> Query (xml, props,handle) --> handle
> GetHits (handle, offset, limit) --> hits
For the simple API this is not a problem, since the search engine can
delay the search until GetHits() is called (which is probably directly
after Query()). But with the live API I see the potential problem. It
isn't possible to wait for a call to GetHits() since it will not be
called until the search engine has reported a hit. One way would be to
require the search engines you mention to always include all possible
properties (at least then called with the live API). But of course it
might be desirable to optimize things a bit.
But at a theoretical level this problem gets even more delicate since
it would be possible that even the expensive properties need to be
known in advance by the search engine (requesting them later might
require to do the whole search over again). So what would be useful for
the search engine to know is which properties will never be requested,
and can be left out to speed up the search.
Despite the somewhat illogical approach, I think it would be more
practical to set this for the session (that is globally) since the same
properties will likely be requested from every search. So there could
be a function (only for optimization purposes) like:
SessionLimitProperties ( in as properties )
> The most obvious problem is that it leaves the Snippets to be
> retrieved along side other properties - which we agreed on was a bad
> idea. Since each hit is uniquely determined in a query context by its
> sequence number, we could use that to look up additional hit
> metadata, fx. the snippet. Adding API like so:
>
> GetHitMetadata (in s handle, in ai hits, in as props) --> results
>
> What do you guys say? If you decline again I shall accept it and hold
> my peace regarding this issue. Cheers,
To sum up the problem. There are inexpensive properties and there are
expensive ones (perhaps only snippets, and only in some cases). And this
might not be obvious to users (of the API) if the API itself doesn't
make it obvious. The question is if there should be be a separate
function for the expensive properties, something like:
GetOnlyInexpensiveHitProperties ( in s query_handle, in i offset, in i
limit, in as properties, out a{sa{sas}} response )
GetExpensiveHitProperties ( in s query_handle, in i offset, in i limit,
in as properties, out a{sa{sas}} response )
This would make even more sense if the set of properties to include are
specified before the search is stared, the functions could look
something like:
GetOnlyInexpensiveHitProperties ( in s query_handle, in i offset, in i
limit, out a{sa{sas}} response )
GetExpensiveHitProperties ( in s query_handle, in i offset, in i limit,
out a{sa{sas}} response )
Another idea would be to have some flag for excluding the expensive
ones. Or perhaps a function which separates the expensive ones from the
inexpensive ones, like:
AppriseProperties ( in as properties, out as inexpensive_properties,
out as expensive_properties )
Or perhaps it's enough if the API documentation mention that snippets
probably are expensive. I think it's a bad idea to call some properties
something else (like metadata) just because they might be expensive, I
just find that confusing.
I'm not sure which idea I personally prefer, I have to think a bit
about it.
More information about the xdg
mailing list