Simple search API proposal, take 2

Wed Jan 10 07:25:53 PST 2007

2007/1/4, Magnus Bergman <magnus.bergman at observer.net>:
> First some comments on the current draft[1]
> """""""""""""""""""""""""""""""""""""""""""
>
>   I think it's a bad idea to use a query-string to identify a search for
>   the following reasons:
>   * It is inefficient to sent a (possibly quite long) string for every
>     call.
>   * It isn't logical for the search engine to use the query string to
>     lookup the search because a query might generate a different result
>     depending on then the search is started.
>   * An application might create different searches from the same query
>     (string) with different result ("all files created this minute").
>
>   Because of these reasons I propose to provide a *search handle*
>   (probably just an integer value) for each search that is created.
>
>   From what I read in the discussion it seems problematic to use URIs
>   as persistant identifiers to identify a hit. Because of the reasons
>   already mentioned and because a hit is not the same thing as a
>   document. Even if a URI was a persistant identifier for a document, it
>   would be illogical to use it to identify a hit. And because of this
>   and the reasons mentioned above it would be even worse to use a query
>   string and a URI to identify a hit.
>
>   Instead I support the idea of simply using sequence numbers (and a
>   search handle) to identify a hit.
>
>
>
> Highlighting, streaming and snippets
> """"""""""""""""""""""""""""""""""""
>
>   It isn't clear what a snippet is exactly. But my guess is that it is a
>   selected part or summary of the document that especially well
>   demonstrate why it matched, possibly with highlighting. And it isn't
>   stored in the index but dynamically generated. Correct?
>
>   I have brought up the question about a need for a document streaming
>   infrastructure. But now I see that highlighting is to be supported,
>   so document streaming seems to be needed anyway.
>
>   The highlighting can not be done by the application, it must be done
>   by the search engine. Just highlighting every word from the query
>   string isn't correct. The knowledge from search engine is needed to
>   get it right. This means that to highlight a document (or a selected
>   part of it) there is no other way to do it that to stream the
>   document though the search engine to the application.
>
>   If snippets are going to be supported it will be easy to also support
>   delivering the whole document highlighted, and even easier to just
>   deliver the whole document.
>
>   Streaming the document means to automatically convert it into a
>   requested format (something that the indexer can extract words from
>   or something that an application can show). Doing this is actually no
>   big deal, doing the highlighting is the hard part.
>
>   The benefit of being able to stream documents like this is that the
>   documents doesn't need to be accessible in a way an application can
>   understand (they are not required to have a URI).
>
>   I don't say this is a feature we can't live without. But we
>   practically get it for free if snippets are going to be supported.
>
>
>
> Properties for hits
> """""""""""""""""""
>
>   Hits are not the same thing as documents, so these are really both
>   properties of the hits and properties of the document. The properties
>   of the hits include information on why the document matched the query
>   and link to the matching document. This link might be kept secret by
>   the search engine, but a URI might be provided as a property of the
>   document. The properties of the document are of course the usual
>   document meta data. Some of these might be stored in the search
>   engines index, some might be extracted from the document dynamically,
>   but that doesn't matter. The properties belonging to the document (as
>   well as the document itself) can be accessed independently of a
>   search, the ones belonging to the hit can not.
>
>
>
> The actual proposal
> """""""""""""""""""
>
> ShowConfiguration ( )
>
>     Open a graphical interface for configuring the search tool.
>
>
> NewSearch ( in s query , out i search )
>
>     Start a new search from a query string.
>     * query: The query string to execute.
>     * search: A handle that is used to uniquely identify this search.
>
>
> CountHits ( in i search , out i count )
>
>     Count the number of hits from a particular search. Used for paging
>     and suggestion popups with hit counts.
>     * search: A handle that is used to uniquely identify a search.
>     * count: The number of hits from this search.
>
>
> GetHitProperties ( in i search, in i offset, in i limit,
>                    in as properties, out a{sa{sas}} response )
>
>     Get properties for the given hits. URIs and snippets are just
>     properties.
>     * search: A handle that is used to uniquely identify a search.
>     * offset: The offset in the result list for the first returned
>               result.
>     * limit: The maximum number of results that should be returned.
>     * properties: A list of properties to return. An empty list is a
>                   request for all properties.
>     * response: A map mapping each hit (sequence number) to a map of
>                 property-list of values pairs.
>
>
>
> [1] http://wiki.freedesktop.org/wiki/WasabiSearchSimple
>

There has been general good feedback on Magnus proposal, so I updated
the wiki: http://wiki.freedesktop.org/wiki/WasabiSearchSimple

Cheers,
Mikkel