simple search api (was Re: mimetype standardisation by testsets)

Tue Jan 2 08:28:26 PST 2007

(Resending with CC to xdg list. Now I have been bugged out on diseases
for a couple of weeks and have to apologise for the late replay.)

> Sorry for the late reply I've been totally bugged out on diseases and
> work! Here goes :-)
> 
> Because of the long nature of this mail I summarize some important
> questions in the bottom of the mail...
> 
> 2006/11/30, Magnus Bergman <magnus.bergman at observer.net>:
> >
> >
> > After reading everything in this thread and considered all concerns
> > mentioned. I thinks it's time I come up with something concrete
> > myself, but just criticizing others. From the requirements and
> > suggestion mentioned on this list I tried to come up with a
> > proposal with these things in mind:
> > * The possibility for application authors to do searched very
> > easily.
> > * The possibility to do both synchronous and asynchronous searched
> >   without having two different APIs.
> > * Not ruling out the possibility to use a dbus interface directly.
> > * Not ruling out the possibility to have a library.
> > * Not causing the search engine to do unnecessary work (like
> > repeating searched if the hits need to be retrieved again)
> >
> > One problem that I choose to leave out (for now) is the need to be
> > able to stream document which has a URL the applications cannot
> > understand (document which are not files). This includes files
> > inside other files and virtual documents that are constructed on
> > demand. But at least I have had this in mind so it's not impossible
> > to add it later.
> 
> Well, I don't think this belongs in search a api as such. This
> functionality sounds more like a metadata storage to me... Which is
> planned for standardization later - so, yeah, let's keep the options
> open, but punt the issue for now.
> 
> 
> > Disclaimer: The names of the functions are not part of the proposal,
> > they are just chosen to illustrate what the functions do. And this
> > proposal does not suggest a library API over a dbus interface, the
> > exact same idea applies to both cases. (It is also agnostic the
> > whatever query language used.)
> >
> > First a set of three basic functions that alone does most things and
> > are probably sufficient for everybody who want a simple API:
> >
> >   session_handle = session_new()
> >
> >     Creates a new session and returns a new session handle.
> > Creating a new session might involve finding an appropriate search
> > engine and getting it ready (exactly what happens here is not
> > important). This might just be to open a dbus connection. I think
> > it's OK if this call is blocking(?). Applications would probably
> > want to call it during startup and it should not take that long
> > *too* do whatever needs to be done here (which of course depends on
> > the search engine).
> >
> >   search_handle = search_new(session_handle,query_string)
> >
> >     Starts a new search and returns a new search handle. By default
> >     this function blocks until the search has been performed and the
> >     number of hits is known (see below).
> >
> >   hits = search_get_hits(search_id,max_number_of_hits)
> >
> >     Fetches a number of hits from the search. Each hit is a set of
> >     attributes for the hit (by default it might be URL, score and
> >     perhaps something else important). It can be called several
> > times to retrieve more hits (much like read(2)). The hits are are
> > sorted, by default by their score.
> 
> 
> 
> Ok, more minimalistic than the current simple interface, but I guess
> it could work.
> 
> When I compared this interface to the current live one proposed at
> http://wiki.freedesktop.org/wiki/WasabiSearchLive, my first thought
> was that you session object was equivalent to the dbus connection
> made by the application. In my proposal the app then uses the
> connection/session to obtain a Query object with the NewQuery()
> method.

If dbus is used (which I'm not against in any way) the session will
probably map directly to a dbus-connection, yes. The term "query" seems
to be a little ambiguous. It often refers to a question which hasn't
yet been sent to the search engines (but might be a compiled binary
object). And Some times it refers to an object which maps to a
search-event in the search-engine. There seem to a some
misunderstanding related to this in the thread. So I propose the
following terminology to avoid confusion:

. Query: A question or a fragment of a question (sub-query), possibly
         compiled, which is used to initiate a search event in the
         search engine. (This object will probably not be visible in a
         simple search API, searches can be created from a query-string
         directly.)

. Search: Created from a query (and possibly other data) and refers a
          specific search event.

. Document: A logical unit of information (text), usually a file.

. Hit: The connection between a search and a document.

. Index: (aka collection) A set of documents which can be searched.
         (Abstraction this concept is probably not relevant at the
         moment, it's more important then several search engines are
         used simultaneously.)

Note that a query doesn't necessarily map to the same search (for the
same set of documents). If a query says something like "all documents
created the last four hours" it will result in a different search each
time it's executed. This is one of the reasons I thinks it's a bad idea
to use the query-string to identify searches.

> From an application developers point of view this might be correct. I
> just forgot to look at this through my search engine developers
> glasses :-) From the search engines perspective it might actually be
> nice to have a parent session for each query. This is actually
> something we have to ask the search engine developers about. See the
> bottom of this mail.
> 
> 
> For slightly less simple use there are some more functions:
> >
> >   session_free(session_handle)
> >
>     Frees all resources related to the session. This includes all
> >     searches created from the session.
> 
> 
> 
> I think this should be available in a simple api if we use Session
> objects.
> - I do realise that you only want one api though :-)
> 
> 
>   session_set_search_finished_signal(session_handle,signal_handler)
> >
> >     Sets a signal handler which is invoked then a search has been
> >     finished. The signal handler gets the search handle back so
> >     different searched can be held apart. If this is set the
> > function search_new() will not block.
> 
> 
> 
> It seems simpler to me that the applications simple ask "are you
> done?" each time it receives a new batch of hits. A bit more dbus
> traffic, but not much. Thus having a
> search_is_finished(search_handle) instead (which you actually define
> below).

That doesn't work because the search engine might not know if a hit is
the last one or not. Neither can it wait to until it has at least one
more hit than the client will ask for (or until the search is finished).
But that doesn't work either (at least not with my proposal) since
it doesn't know how many hits the the client will ask for. Another
solution if that the engines sends the same "hits available" signal as
then there are hits available, but with the hits available set to zero.
But this is pretty much the same thing as having two different signals
(callbacks).

>   session_set_search_progress_signal(session_handle,signal_handler)
> >
> >     Sets a signal handler which is invoked then there are new hits
> >     available (hits which hasn't been retrieved with
> >     search_get_hits()). The signal handler gets the search_handler,
> >     maybe some approximation about percentage of the progress and
> > maybe the number of new hits for convenience.
> 
> 
> 
> Why not return the hits with the signal? I see something cool in not
> returning the results until the application specifically requests them
> though. It reminds somewhat of the way spotlight does it, and it is
> also closer to what libbeagle does. This is an important point - I
> added it to the bottom of this mail.

I think it's easier to have just one function for getting the hits.
Besides, an application might want to know about new hits without
actually getting them (perhaps it displays just a few hits at a time).
And perhaps it wants to update a statusbar telling about the hits quite
frequently but get the hits in chunks less frequently. You might say
this doesn't belong in a simple API, but I think the API is still quite
simple even if it's possible to do some less simple things with it.

>   session_set_property(session_handle,property_name,value)
> >
> >     Sets a property of the session. This might include default sort
> >     order, maximum number of hits (mostly as a hint to the search
> >     engine), minimum score, default set of attributes for hits in
> > new searches, is searches should live on (never considered finished
> > but continue to generate new hits if new matching documents show
> > up) and probably some other stuff.
> 
> 
> 
> The properties you mention sounds more like properties of the query
> of you ask me...

Yes, you're right. It is probably possible to come up with better
examples of sensible properties for the session. (Something I use it for
is encryption keys, but that sure isn't needed in a simple API). The
reason I think it's a good idea to set these things (witch are really
related to the query) in the session is that most application will very
likely use the same values for each search. Therefore it's handy (but
less logical) to set them on the session.

>   value = session_get_property(session_handle,property_name)
> >
> >     Does the expected.
> >
> >   search_free(search_handle)
> >
> >     Frees all resources related to the session. The search handle
> >     becomes invalid afterwards.
> >
> >   search_is_finished(search_handle)
> >
> >     Checks if the search is finished yet.
> 
> 
> 
> Check, check, and check on those methods.
> 
> 
>   search_get_number_of_total_hits_so_far(search_handle)
> >
> >     Gets the total number of hits this search resulted in (minus the
> >     ones discarded because of too low score of course). If the
> > search finished signal handler has been set the search might not
> > yet be finished and the number of hits so far is returned.
> >
> >   search_get_number_of_new_hits_so_far(search_handle)
> >
> >     Identical to the one above, but minus the number of hits already
> >     retrieved using search_get_hits() (or skipped using
> >     search_seek(), see bolow).
> >
> >   search_tell()
> >
> >     Tells how many hits that has been retrieved so far.
> 
> 
> 
> The above three methods doesn't feel right... There seems to be some
> book keeping that could be done on the client side just as well.

With any two of them you could calculate the result of the third. And
the result of search_tell() could calculated counting the number of
hits retrieved. So strictly speaking only one of them are needed. But I
thought they could be provided for convenience. Perhaps they make more
sense in a library than in a protocol.

>   search_seek()
> >
> >     Moves the cursor in the search to either skip searches or go
> > back to read them again (much like lseek(2)). Yes, the name is bad,
> > I know (see disclaimer above). (Perhaps search_tell() and
> >     search_seek() can be replaced by a property.)
> 
> 
> 
> Is this method actually useful? I think it needs real good
> justification since it will introduce quite some work on the search
> engine side to support (correct me if I'm wrong).

The search engines I use keep the hits until the search-object is
destroyed. Most search engines are designed to work with
near-state-less applications like web-pages. But perhaps this shouldn't
be supported if there are search engines that toss the hits as soon as
they are retrieved.

>   search_set_property(search_handle,property_name,value)
> >
> >     Sets a property of the search. This might include sort order
> > (for remaining hits if some has already been retrieved), set of
> >     attributes for hits and probably some other stuff.
> >
> >   value = search_get_property(search_handle,property_name)
> >
> >     Does the exget_querypected.
> >
> 
> 
> Question 1 : Will it benefit the search engine to have a Session
> object for each connection? Then Query objects are spawned by a call
> like Magnus suggest; Query = NewQuery(Session, query_string)? Is it
> correct that applications doesn't need to care about sessions - just
> gimme the goddam query! ? :-)

If dbus is used then the connection IS the session (I guess without
knowing to much about dbus). If something else is used (for example a
library) then the session is an abstraction of something else.

> Question 2 : Should the results be returned with the HitsAdded
> signal? The Query object then has a Query.GetResults method to
> retrieve the results. This is closer to libbeagle and spotlight and
> the application only spends time retrieving hits when it really wants
> to. It does introduce some extra method calls though...

According to my proposal the HitsAdded signal is optional to use. The
application can choose to poll for hits too. As I explained above I
think it's best to not let the HitsAdded signal push the hits into the
application since it might not want all hits right away.

> In the http://wiki.freedesktop.org/wiki/WasabiSearchLive proposal the
> session and the query object is somewhat merged (since you can change
> a running query (restarting it)). I personally think it is rather
> elegant, but perhaps it is really just a mess.

According to my terminology it is the query-object and the search
object that is merged (the query object gets connected to a
search-object then it's started). By the way, is the query considered
to be running if it's finished? Couldn't the following methods
be combined into one for creating the search directly?

method NewQuery () 
method SetQueryXML (in s query_xml)
method Start () 

Could be just:

method NewSearch (in s query_xml)

* returns a new search object (equivalent to what you call a running
  query).

And perhaps the a snippet could be just another property of a hit, as
well as the URI. In that case your function GetProperties() could be
called GetHits() and be equivalent to my search_get_hits().

I will look closer at your updated proposal and post an update of
mine next week.