Simple search API proposal, take 2

Thu Jan 4 06:13:58 PST 2007

First some comments on the current draft[1]
"""""""""""""""""""""""""""""""""""""""""""

  I think it's a bad idea to use a query-string to identify a search for
  the following reasons:
  * It is inefficient to sent a (possibly quite long) string for every
    call.
  * It isn't logical for the search engine to use the query string to
    lookup the search because a query might generate a different result
    depending on then the search is started.
  * An application might create different searches from the same query
    (string) with different result ("all files created this minute").

  Because of these reasons I propose to provide a *search handle*
  (probably just an integer value) for each search that is created.

  From what I read in the discussion it seems problematic to use URIs
  as persistant identifiers to identify a hit. Because of the reasons
  already mentioned and because a hit is not the same thing as a
  document. Even if a URI was a persistant identifier for a document, it
  would be illogical to use it to identify a hit. And because of this
  and the reasons mentioned above it would be even worse to use a query
  string and a URI to identify a hit.

  Instead I support the idea of simply using sequence numbers (and a
  search handle) to identify a hit.

Highlighting, streaming and snippets
""""""""""""""""""""""""""""""""""""

  It isn't clear what a snippet is exactly. But my guess is that it is a
  selected part or summary of the document that especially well
  demonstrate why it matched, possibly with highlighting. And it isn't
  stored in the index but dynamically generated. Correct?

  I have brought up the question about a need for a document streaming
  infrastructure. But now I see that highlighting is to be supported,
  so document streaming seems to be needed anyway.

  The highlighting can not be done by the application, it must be done
  by the search engine. Just highlighting every word from the query
  string isn't correct. The knowledge from search engine is needed to
  get it right. This means that to highlight a document (or a selected
  part of it) there is no other way to do it that to stream the
  document though the search engine to the application.

  If snippets are going to be supported it will be easy to also support
  delivering the whole document highlighted, and even easier to just
  deliver the whole document.

  Streaming the document means to automatically convert it into a
  requested format (something that the indexer can extract words from
  or something that an application can show). Doing this is actually no
  big deal, doing the highlighting is the hard part.

  The benefit of being able to stream documents like this is that the
  documents doesn't need to be accessible in a way an application can
  understand (they are not required to have a URI).

  I don't say this is a feature we can't live without. But we
  practically get it for free if snippets are going to be supported.

Properties for hits
"""""""""""""""""""

  Hits are not the same thing as documents, so these are really both
  properties of the hits and properties of the document. The properties
  of the hits include information on why the document matched the query
  and link to the matching document. This link might be kept secret by
  the search engine, but a URI might be provided as a property of the
  document. The properties of the document are of course the usual
  document meta data. Some of these might be stored in the search
  engines index, some might be extracted from the document dynamically,
  but that doesn't matter. The properties belonging to the document (as
  well as the document itself) can be accessed independently of a
  search, the ones belonging to the hit can not.

The actual proposal
"""""""""""""""""""

ShowConfiguration ( )

    Open a graphical interface for configuring the search tool.

NewSearch ( in s query , out i search )

    Start a new search from a query string.
    * query: The query string to execute.
    * search: A handle that is used to uniquely identify this search.

CountHits ( in i search , out i count )

    Count the number of hits from a particular search. Used for paging
    and suggestion popups with hit counts.
    * search: A handle that is used to uniquely identify a search.
    * count: The number of hits from this search.

GetHitProperties ( in i search, in i offset, in i limit,
                   in as properties, out a{sa{sas}} response )

    Get properties for the given hits. URIs and snippets are just
    properties.
    * search: A handle that is used to uniquely identify a search.
    * offset: The offset in the result list for the first returned
              result.
    * limit: The maximum number of results that should be returned.
    * properties: A list of properties to return. An empty list is a
                  request for all properties.
    * response: A map mapping each hit (sequence number) to a map of
                property-list of values pairs.

[1] http://wiki.freedesktop.org/wiki/WasabiSearchSimple