simple search api (was Re: mimetype standardisation by testsets)

Mon Nov 20 19:24:32 EET 2006

2006/11/20, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com>:
> 2006/11/19, Jos van den Oever <jvdoever at gmail.com>:
> > 2006/11/19, Mikkel Kamstrup Erlandsen <mikkel.kamstrup at gmail.com>:
> >
> > > org.freedesktop.search.simple.countHits ( in s query ,
> out
> > > i count ):
> > >  Why is this necesary. Is it to accomodate use cases such as suggestion
> > > popups like below[1], and reduce the dbus wire traffic for multiple
> calls?
> > Not just that, it is also required for paging ( page 1, page 2, etc)
> > and lines like 'Found X hits.'
>
> Ah, ofcourse :-)
>
> > > Type text in entry: net
> > > Get a popup with:
> > >             net   (7801)
> > >             network (6578)
> > >             netto (17)
> > No, this is another case. Here you see numbers for each keyword, but
> > where do you get the keywords from? For something like that you'd need
> > a call like:
> > expandWord(in s word, out a(si) wordlist)
>
> Well, it could be based on an expandQuery() method, but applications could
> also list old searches with hit counts. Ie. in the above example I would
> have searched for net, network and netto at some point in the past. There
> are numerous ways to obtain suggestions...
Agreed, so this is not important for this discussion.

> > > org.freedesktop.search.simple.query ( in s query, in i
> > > offset, in i limit , out as hits ):
> > >  What is the general consumer of this method? I don't see many. Only
> stuff
> > > like deskbar-applet or a general search tool would use it. Maybe adding
> a
> > > parameter to specify a list of groups the hits should match (or maybe
> > > specifying mimetypes). This argument could be "*" or something to get
> all
> > > kinds of results. I suggest changing the signature to:
> > > query ( in s query, in as groups, in i offset, in i limit , out as hits
> )
> > Interesting suggestion. It does make things quite a bit more
> > complicated. Because you'd need to define the groups. We've not talked
> > about the query language yet ( we need to, but i'm assuming we're
> > going to use something similar to what Beagle and Strigi already use,
> > which is almost the same), but you also just expand the query like
> > this: "holiday" -> "holiday mimetype:video/*" before sending it to the
> > search-engine. That seems much better defined than a list of vaguely
> > termed groups. I do not object to having such names for the user to
> > see though.
>
>
> Yeah,  there are a few decisions to make here.  How much to put in the query
> language and how much to put in the api. I think and expressive query
> language is a good idea. However your example above doesn't fir all cases
> well. What If I want to search for all "Documents" containing the word
> "parser". The Documents group could fx. be files of mime types:
>
> application/msword
> application/pdf
> application/postscript
> application/vnd.ms-excel
> application/vnd.oasis.opendocument.text
> application/vnd.sun.xml.writer
> text/plain
> text/html
>
> calling "parser mimetype:application/*" would probably not yield the desired
> results. Maybe also having an option to search like "parser group:documents"
> would be good? The Spotlight API has the notion of groups like this for
> example.
This notion of groups is very valuable for a nice user interface. It
is however not relevant for the simplest form of search engine. The
group designation of a file is usually not stored directly in the
database, but inferred over the mimetype. For complex groups the query
might look something like (application/msword OR application/pdf OR
...). Making such a list part of a search API makes it hard to agree
on the mimetypes. I do not oppose a wrapper API the knows about the
groups and expands a group-enabled-query, but I dont think we should
put this in the simple API. The group(s) to which a file belongs is
just another type of (inferred) metadata and i dont think we should
treat is specially.

Cheers,
Jos