Simple search API proposal, take 2

Thu Jan 4 14:29:05 PST 2007

2007/1/4, Magnus Bergman <magnus.bergman at observer.net>:
> First some comments on the current draft[1]
> """""""""""""""""""""""""""""""""""""""""""
>
>   I think it's a bad idea to use a query-string to identify a search for
>   the following reasons:
>   * It is inefficient to sent a (possibly quite long) string for every
>     call.
>   * It isn't logical for the search engine to use the query string to
>     lookup the search because a query might generate a different result
>     depending on then the search is started.
>   * An application might create different searches from the same query
>     (string) with different result ("all files created this minute").
>
>   Because of these reasons I propose to provide a *search handle*
>   (probably just an integer value) for each search that is created.

I think we should use strings as handles to allow search engines to
put what ever they want in the handle.

>
>   From what I read in the discussion it seems problematic to use URIs
>   as persistant identifiers to identify a hit. Because of the reasons
>   already mentioned and because a hit is not the same thing as a
>   document. Even if a URI was a persistant identifier for a document, it
>   would be illogical to use it to identify a hit. And because of this
>   and the reasons mentioned above it would be even worse to use a query
>   string and a URI to identify a hit.
>   Instead I support the idea of simply using sequence numbers (and a
>   search handle) to identify a hit.

I follow you on using opaque identifiers as to using opaque
identifiers. But do you want to use other identifiers for the hits
than the persistent identifiers for the objects? Or do you mean that
each hit is uniquely determined by "persistent_id:search_handle" or
something like that?

>
>
> Highlighting, streaming and snippets
> """"""""""""""""""""""""""""""""""""
>
>   It isn't clear what a snippet is exactly. But my guess is that it is a
>   selected part or summary of the document that especially well
>   demonstrate why it matched, possibly with highlighting. And it isn't
>   stored in the index but dynamically generated. Correct?

Yes. Like Google hits fx.

>
>   I have brought up the question about a need for a document streaming
>   infrastructure. But now I see that highlighting is to be supported,
>   so document streaming seems to be needed anyway.

I fail to see why you need full document streaming to do snippets. It
is just a matter of returning a string with a bit of markup.

>   The highlighting can not be done by the application, it must be done
>   by the search engine. Just highlighting every word from the query
>   string isn't correct. The knowledge from search engine is needed to
>   get it right. This means that to highlight a document (or a selected
>   part of it) there is no other way to do it that to stream the
>   document though the search engine to the application.

 Again, why the whole document? The app just use the snippet. I agree
that the snippet generation and markup belongs on the server side - in
addition to what you mention there is also things as stemming and what
not.

>   If snippets are going to be supported it will be easy to also support
>   delivering the whole document highlighted, and even easier to just
>   deliver the whole document.

Why would the app want that? Also, how can we assume that the engine
even has access to the document? A search engine might support having
third party apps "inject" documents (and or metadata) into them. This
way the search engine wouldn't know how to retrieve the document.

Think for example a note taking app that stores notes on an sftp (or
my_obscure_protocol) server.

>
>   Streaming the document means to automatically convert it into a
>   requested format (something that the indexer can extract words from
>   or something that an application can show). Doing this is actually no
>   big deal, doing the highlighting is the hard part.
>
>   The benefit of being able to stream documents like this is that the
>   documents doesn't need to be accessible in a way an application can
>   understand (they are not required to have a URI).
>
>   I don't say this is a feature we can't live without. But we
>   practically get it for free if snippets are going to be supported.
>

Maybe I've misunderstood in my above remarks about what you consider
streaming... Do you just want to "stream" the raw filtered text-only?
Like fx. stripped html document (without any tags, just the text
elements contained).

>
>
> Properties for hits
> """""""""""""""""""
>
>   Hits are not the same thing as documents, so these are really both
>   properties of the hits and properties of the document. The properties
>   of the hits include information on why the document matched the query
>   and link to the matching document. This link might be kept secret by
>   the search engine, but a URI might be provided as a property of the
>   document. The properties of the document are of course the usual
>   document meta data. Some of these might be stored in the search
>   engines index, some might be extracted from the document dynamically,
>   but that doesn't matter. The properties belonging to the document (as
>   well as the document itself) can be accessed independently of a
>   search, the ones belonging to the hit can not.
>

Right, agreed. While apps wanting to play around with only document
properties (and not hit properties), would typically want an interface
to a metadata server instead of a search engine, no? Fx. they might
want to be able to set metadata properties as well.

>
>
> The actual proposal
> """""""""""""""""""
>
> ShowConfiguration ( )
>
>     Open a graphical interface for configuring the search tool.
>
>
> NewSearch ( in s query , out i search )
>
>     Start a new search from a query string.
>     * query: The query string to execute.
>     * search: A handle that is used to uniquely identify this search.
>
>
> CountHits ( in i search , out i count )
>
>     Count the number of hits from a particular search. Used for paging
>     and suggestion popups with hit counts.
>     * search: A handle that is used to uniquely identify a search.
>     * count: The number of hits from this search.
>
>
> GetHitProperties ( in i search, in i offset, in i limit,
>                    in as properties, out a{sa{sas}} response )
>
>     Get properties for the given hits. URIs and snippets are just
>     properties.
>     * search: A handle that is used to uniquely identify a search.
>     * offset: The offset in the result list for the first returned
>               result.
>     * limit: The maximum number of results that should be returned.
>     * properties: A list of properties to return. An empty list is a
>                   request for all properties.
>     * response: A map mapping each hit (sequence number) to a map of
>                 property-list of values pairs.
>
> [1] http://wiki.freedesktop.org/wiki/WasabiSearchSimple
>

I like your suggestion.

Did you leave out the snippet part because you consider as a hit
property? It should be noted that the various search engine devs
consider snippet extraction an expensive operation. An app could issue
two GetHitProperties calls of course, but having a separate method for
this might serve as a warning...

My main concern about your suggestion is that it is leaning towards
the live api proposal (with opaque ids instead of uri as identifiers
http://wiki.freedesktop.org/wiki/WasabiSearchLive). That might again
also be a good idea. If the two interfaces become very alike it we
might even reconcider if we should have a simple api at all.

I do concider you suggestion to be so much simpler that it does
warrant its own api. It is very sync in nature after all.

Cheers,
Mikkel