Simple search API proposal, take 2

Fri Jan 5 10:38:54 PST 2007

On Thu, 4 Jan 2007 23:29:05 +0100
"Mikkel Kamstrup Erlandsen" <mikkel.kamstrup at gmail.com> wrote:

> 2007/1/4, Magnus Bergman <magnus.bergman at observer.net>:
> > First some comments on the current draft[1]
> > """""""""""""""""""""""""""""""""""""""""""
> >
> >   I think it's a bad idea to use a query-string to identify a
> > search for the following reasons:
> >   * It is inefficient to sent a (possibly quite long) string for
> > every call.
> >   * It isn't logical for the search engine to use the query string
> > to lookup the search because a query might generate a different
> > result depending on then the search is started.
> >   * An application might create different searches from the same
> > query (string) with different result ("all files created this
> > minute").
> >
> >   Because of these reasons I propose to provide a *search handle*
> >   (probably just an integer value) for each search that is created.
> 
> I think we should use strings as handles to allow search engines to
> put what ever they want in the handle.

I assume the search engine has some complex internal structure for each
search. Looking up the right one could as easily be done using just a
number, right? I also assume that each search is tied to a session. So
that one application cannot snoop on what another application is
searching for (by figuring out or guessing search-ids used by other
sessions).

> > From what I read in the discussion it seems problematic to use
> > URIs as persistant identifiers to identify a hit. Because of the
> > reasons already mentioned and because a hit is not the same thing
> > as a document. Even if a URI was a persistant identifier for a
> > document, it would be illogical to use it to identify a hit. And
> > because of this and the reasons mentioned above it would be even
> > worse to use a query string and a URI to identify a hit.
> > Instead I support the idea of simply using sequence numbers (and a
> > search handle) to identify a hit.
> 
> 
> I follow you on using opaque identifiers as to using opaque
> identifiers. But do you want to use other identifiers for the hits
> than the persistent identifiers for the objects? Or do you mean that
> each hit is uniquely determined by "persistent_id:search_handle" or
> something like that?

Do you by objects mean documents (aka items)? In that case, yes. A hit
would be identified by "search_id:hit_number" (and implicitly also the
session).

> > Highlighting, streaming and snippets
> > """"""""""""""""""""""""""""""""""""
> >
> >   It isn't clear what a snippet is exactly. But my guess is that it
> > is a selected part or summary of the document that especially well
> >   demonstrate why it matched, possibly with highlighting. And it
> > isn't stored in the index but dynamically generated. Correct?
> 
> Yes. Like Google hits fx.
> 
> >
> >   I have brought up the question about a need for a document
> > streaming infrastructure. But now I see that highlighting is to be
> > supported, so document streaming seems to be needed anyway.
> 
> I fail to see why you need full document streaming to do snippets. It
> is just a matter of returning a string with a bit of markup.

If the snippet is stored in the index it isn't needed (which is how
google does it, I think). But that would create huge indexes (typically
10%-20% of the size of the documents being indexed). It will also lower
the quality of the snippets (since there is no possibility to extract
the most relevant part of the document for for a specific query). But I
assume that this is not what you intended since you wrote that this is
assumed to be slow (and getting snippets from the index isn't slow).
The other way to do it is to retrieve and scan the source document
(probably converting it's a word processor document for example) and
generate the snippet dynamically. This is the way it's usually done if
the documents are locally stored and easily accessible. So, to get a
string to return you have to read the document, which might include
constructing it (for example extracting an e-mail attachment) and
converting if to something that can be displayed.

If the snippets are stored in the index I think they should rather be
called summaries (which might created by the indexer or just extracted
from a document which already has one). So if an application requests a
summary (possibly with highlighting) it is guaranteed to be fast. And
if a snippet is requested it might be slow (but it might also be a more
relevant piece of text). The summary is of course a document property
(unlike the snippet which is a hit property) since it can be accessed
independently of a search.

> 
> 
> > The highlighting can not be done by the application, it must be
> > done by the search engine. Just highlighting every word from the
> > query string isn't correct. The knowledge from search engine is
> > needed to get it right. This means that to highlight a document (or
> > a selected part of it) there is no other way to do it that to
> > stream the document though the search engine to the application.
> 
>  Again, why the whole document? The app just use the snippet. I agree
> that the snippet generation and markup belongs on the server side - in
> addition to what you mention there is also things as stemming and what
> not.

I don't say that streaming the whole document is necessary, what I say
is that you get it for free if you have a document streaming framework
anyway. And the benefit of streaming the whole document is that you can
access document which have no URI (they are not ordinary files, but
perhaps e-mail attachments). You can also get automatic document
conversion.

> 
> 
> >   If snippets are going to be supported it will be easy to also
> > support delivering the whole document highlighted, and even easier
> > to just deliver the whole document.
> 
> Why would the app want that? Also, how can we assume that the engine
> even has access to the document? A search engine might support having
> third party apps "inject" documents (and or metadata) into them. This
> way the search engine wouldn't know how to retrieve the document.
> 
> Think for example a note taking app that stores notes on an sftp (or
> my_obscure_protocol) server.

It is true that I did assume that at least the search engine has access
to (and knows how to access) the documents. But are there also
scenarios where specific applications inject documents into the search
engine and are also the only application which can display them?

> 
> >
> >   Streaming the document means to automatically convert it into a
> >   requested format (something that the indexer can extract words
> > from or something that an application can show). Doing this is
> > actually no big deal, doing the highlighting is the hard part.
> >
> >   The benefit of being able to stream documents like this is that
> > the documents doesn't need to be accessible in a way an application
> > can understand (they are not required to have a URI).
> >
> >   I don't say this is a feature we can't live without. But we
> >   practically get it for free if snippets are going to be supported.
> >
> 
> Maybe I've misunderstood in my above remarks about what you consider
> streaming... Do you just want to "stream" the raw filtered text-only?
> Like fx. stripped html document (without any tags, just the text
> elements contained).

No, I want the stream to be whatever mimetype the application (and the
indexer) requests. The indexer might request raw text, and some
application might request html (in order to get highlighting).

> 
> >
> >
> > Properties for hits
> > """""""""""""""""""
> >
> > Hits are not the same thing as documents, so these are really both
> > properties of the hits and properties of the document. The
> > properties of the hits include information on why the document
> > matched the query and link to the matching document. This link
> > might be kept secret by the search engine, but a URI might be
> > provided as a property of the document. The properties of the
> > document are of course the usual document meta data. Some of these
> > might be stored in the search engines index, some might be
> > extracted from the document dynamically, but that doesn't matter.
> > The properties belonging to the document (as well as the document
> > itself) can be accessed independently of a search, the ones
> > belonging to the hit can not.
> >
> 
> Right, agreed. While apps wanting to play around with only document
> properties (and not hit properties), would typically want an interface
> to a metadata server instead of a search engine, no? Fx. they might
> want to be able to set metadata properties as well.

My idea was to throw in the document properties (known by the search
engine) together with the hit properties for convenience (all of them
read only). And not provide a way to change the document properties
using this API. Is your suggestion to separate the hit and document
objects and end up with something like this?

snippet = hit::get_property(hit_id,"snippet")

document_id = hit::get_property(hit_id,"document")

title = document::get_property(document_id,"title")

(And the last call might belong to a whole different API.)

> 
> >
> >
> > The actual proposal
> > """""""""""""""""""
> >
> > ShowConfiguration ( )
> >
> >     Open a graphical interface for configuring the search tool.
> >
> >
> > NewSearch ( in s query , out i search )
> >
> >     Start a new search from a query string.
> >     * query: The query string to execute.
> >     * search: A handle that is used to uniquely identify this
> > search.
> >
> >
> > CountHits ( in i search , out i count )
> >
> >     Count the number of hits from a particular search. Used for
> > paging and suggestion popups with hit counts.
> >     * search: A handle that is used to uniquely identify a search.
> >     * count: The number of hits from this search.
> >
> >
> > GetHitProperties ( in i search, in i offset, in i limit,
> >                    in as properties, out a{sa{sas}} response )
> >
> >     Get properties for the given hits. URIs and snippets are just
> >     properties.
> >     * search: A handle that is used to uniquely identify a search.
> >     * offset: The offset in the result list for the first returned
> >               result.
> >     * limit: The maximum number of results that should be returned.
> >     * properties: A list of properties to return. An empty list is a
> >                   request for all properties.
> >     * response: A map mapping each hit (sequence number) to a map of
> >                 property-list of values pairs.
> >
> > [1] http://wiki.freedesktop.org/wiki/WasabiSearchSimple
> >
> 
> I like your suggestion.
> 
> Did you leave out the snippet part because you consider as a hit
> property? It should be noted that the various search engine devs
> consider snippet extraction an expensive operation. An app could issue
> two GetHitProperties calls of course, but having a separate method for
> this might serve as a warning...

Yes, I considered it hit property. And yes, my idea was that apps could
issue two GetHitProperty calls. Is it really a common case that
applications are interested in just a few of the snippets? I imagine
that applications show snippets for all entries or none at all.

> 
> My main concern about your suggestion is that it is leaning towards
> the live api proposal (with opaque ids instead of uri as identifiers
> http://wiki.freedesktop.org/wiki/WasabiSearchLive). That might again
> also be a good idea. If the two interfaces become very alike it we
> might even reconcider if we should have a simple api at all.
> 
> I do concider you suggestion to be so much simpler that it does
> warrant its own api. It is very sync in nature after all.

I like the idea of having a simple (synchronous) API. But I think the
ideal situation would be if the simple API was just a subset of the
full one. I suggested the possibility to register callbacks
(subscribe to signals) and at the same time make all calls
non-blocking in my first proposal. (You got that right? There has been
some problems with mail delivery here). That would be one way to have
things both ways without duplicating any methods. I will sketch on a new
proposal for a unified API next week.