Simple search API proposal, take 2

Wed Jan 10 01:57:49 PST 2007

2007/1/10, Stefan.Kost at nokia.com <Stefan.Kost at nokia.com>:
> hi,
>
> I'd like to share one use-case where I am not sure if this is covered by
> the proposal already. Some document centric app save documents as
> archives containing some structured data (e.g. a xml file) and some
> binaries. They might also leave out the binaries and just keep external
> references. When loading such a file later, some external references
> might be missing. Now it would rock if the application could just ask
> the desktop search engine if the files might eventualy just have been
> renamed or moved somewhere else.
> In my app I thought about storingstuff like filesize, md5sum and
> mimetype together with the name. This could serve as additional hint for
> the search. I belive filename + mimetype is quite good. If the external
> file has been updated in the meantime, both size and md5sum are likely
> to be different.

For the sake of concreteness let's consider an example to make sure I
understand correctly. Will a drawing program using svg's with refs to
external jpgs match your example?

If I understand correctly what you want to do is to track file
movement or changes to the jpgs in between different session for the
drawing app.

If a jpg is moved while your app isn't running you want to detect the
new location of the jpg when you start up/load image, right? Sounds
like a neat idea anyway :-)

If the search engine indexes md5+filesize+mime which I guess 99.99% of
the available ones does, you could as you say, search these fields to
find the file again.

> The proposal below looks like the search can be done without any ui
> poping up, which is good. Main open points I see are:
> * how will a query look like

A query will be XML block in a yet to be defined language. I have been
looking into a proposal for this, but I've yet to come up with
anything I actually like. Suggestions are welcome.

The only idea for something to base this on is rdf query which can be
found here: http://www.w3.org/TandS/QL/QL98/pp/rdfquery.html. It is
not exactly spot on what we need though. It appears overly complicated
for desktop needs and it is oriented primarely towards metadata
retrieval not actual free text search (with fx. fuzzy- and proximity
searches).

There is talk about defining a simple end user language with a
google/beagle like syntax too. The underlying system will use the xml
language though.

> * how will the results look like (uri grouped as exact matches and fuzzy
> matches)

Well. There will be support setting a sort mechanism somehow. These
details aren't fleshed out either yet. Rest assured they are not
forgotten though.

Grouping of matches has not received much attention yet, and is
actually not exposed at all in the current api proposal. Perhaps it
should be - I don't know if the current available search engines
support this at all though. I think grouping is done client side in
most apps right now.

Constructive input and proposals are most welcome!

Cheers,
Mikkel

> >-----Original Message-----
> >From: xdg-bounces at lists.freedesktop.org
> >[mailto:xdg-bounces at lists.freedesktop.org] On Behalf Of ext
> >Magnus Bergman
> >Sent: 04 January, 2007 16:14
> >To: xdg at lists.freedesktop.org
> >Subject: Simple search API proposal, take 2
> >
> >First some comments on the current draft[1]
> >"""""""""""""""""""""""""""""""""""""""""""
> >
> >  I think it's a bad idea to use a query-string to identify a
> >search for
> >  the following reasons:
> >  * It is inefficient to sent a (possibly quite long) string for every
> >    call.
> >  * It isn't logical for the search engine to use the query string to
> >    lookup the search because a query might generate a different result
> >    depending on then the search is started.
> >  * An application might create different searches from the same query
> >    (string) with different result ("all files created this minute").
> >
> >  Because of these reasons I propose to provide a *search handle*
> >  (probably just an integer value) for each search that is created.
> >
> >  From what I read in the discussion it seems problematic to use URIs
> >  as persistant identifiers to identify a hit. Because of the reasons
> >  already mentioned and because a hit is not the same thing as a
> >  document. Even if a URI was a persistant identifier for a
> >document, it
> >  would be illogical to use it to identify a hit. And because of this
> >  and the reasons mentioned above it would be even worse to use a query
> >  string and a URI to identify a hit.
> >
> >  Instead I support the idea of simply using sequence numbers (and a
> >  search handle) to identify a hit.
> >
> >
> >
> >Highlighting, streaming and snippets
> >""""""""""""""""""""""""""""""""""""
> >
> >  It isn't clear what a snippet is exactly. But my guess is
> >that it is a
> >  selected part or summary of the document that especially well
> >  demonstrate why it matched, possibly with highlighting. And it isn't
> >  stored in the index but dynamically generated. Correct?
> >
> >  I have brought up the question about a need for a document streaming
> >  infrastructure. But now I see that highlighting is to be supported,
> >  so document streaming seems to be needed anyway.
> >
> >  The highlighting can not be done by the application, it must be done
> >  by the search engine. Just highlighting every word from the query
> >  string isn't correct. The knowledge from search engine is needed to
> >  get it right. This means that to highlight a document (or a selected
> >  part of it) there is no other way to do it that to stream the
> >  document though the search engine to the application.
> >
> >  If snippets are going to be supported it will be easy to also support
> >  delivering the whole document highlighted, and even easier to just
> >  deliver the whole document.
> >
> >  Streaming the document means to automatically convert it into a
> >  requested format (something that the indexer can extract words from
> >  or something that an application can show). Doing this is actually no
> >  big deal, doing the highlighting is the hard part.
> >
> >  The benefit of being able to stream documents like this is that the
> >  documents doesn't need to be accessible in a way an application can
> >  understand (they are not required to have a URI).
> >
> >  I don't say this is a feature we can't live without. But we
> >  practically get it for free if snippets are going to be supported.
> >
> >
> >
> >Properties for hits
> >"""""""""""""""""""
> >
> >  Hits are not the same thing as documents, so these are really both
> >  properties of the hits and properties of the document. The properties
> >  of the hits include information on why the document matched the query
> >  and link to the matching document. This link might be kept secret by
> >  the search engine, but a URI might be provided as a property of the
> >  document. The properties of the document are of course the usual
> >  document meta data. Some of these might be stored in the search
> >  engines index, some might be extracted from the document dynamically,
> >  but that doesn't matter. The properties belonging to the document (as
> >  well as the document itself) can be accessed independently of a
> >  search, the ones belonging to the hit can not.
> >
> >
> >
> >The actual proposal
> >"""""""""""""""""""
> >
> >ShowConfiguration ( )
> >
> >    Open a graphical interface for configuring the search tool.
> >
> >
> >NewSearch ( in s query , out i search )
> >
> >    Start a new search from a query string.
> >    * query: The query string to execute.
> >    * search: A handle that is used to uniquely identify this search.
> >
> >
> >CountHits ( in i search , out i count )
> >
> >    Count the number of hits from a particular search. Used for paging
> >    and suggestion popups with hit counts.
> >    * search: A handle that is used to uniquely identify a search.
> >    * count: The number of hits from this search.
> >
> >
> >GetHitProperties ( in i search, in i offset, in i limit,
> >                   in as properties, out a{sa{sas}} response )
> >
> >    Get properties for the given hits. URIs and snippets are just
> >    properties.
> >    * search: A handle that is used to uniquely identify a search.
> >    * offset: The offset in the result list for the first returned
> >              result.
> >    * limit: The maximum number of results that should be returned.
> >    * properties: A list of properties to return. An empty list is a
> >                  request for all properties.
> >    * response: A map mapping each hit (sequence number) to a map of
> >                property-list of values pairs.
> >
> >
> >
> >[1] http://wiki.freedesktop.org/wiki/WasabiSearchSimple
> >_______________________________________________
> >xdg mailing list
> >xdg at lists.freedesktop.org
> >http://lists.freedesktop.org/mailman/listinfo/xdg
> >
> _______________________________________________
> xdg mailing list
> xdg at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/xdg
>