shared wasabi implementation

Sun Feb 18 12:15:49 PST 2007

2007/2/18, Joe Shaw <joeshaw at novell.com>:
>
> Hi,
>
> Mikkel Kamstrup Erlandsen wrote:
> > 2007/2/16, Joe Shaw <joeshaw at novell.com <mailto:joeshaw at novell.com>>:
> >     For people who want to index their data externally we provide
> >     an indexing service.  Apps can do one of two things: they can make
> an
> >     RPC call and pass in a document and metadata to be indexed, or they
> can
> >     drop the file into ~/.beagle/ToIndex with a control file that
> describes
> >     its metadata and Beagle will automatically index it.  (This latter
> >     method is how the Beagle Firefox extension works.)
> >
> > What kind of rpc is available?
>
> It's just the standard Beagle RPC mechanism (basically XML over a Unix
> domain socket).
>
> C#:
>
> http://svn.gnome.org/viewcvs/beagle/trunk/beagle/BeagleClient/IndexingService.cs?view=markup
>
> C:
>
> http://svn.gnome.org/viewcvs/beagle/trunk/beagle/libbeagle/beagle/beagle-indexing-service-request.h?view=markup

Ok. If we are to standardize something like this, I would assume that we use
dbus for rpc - as far as I can tell that doesn't seem to be a problem..? Fx
a dbus api like:

 - AddFile (in as metadata, in s input_file)
 - AddText (in as metadata, in s text)

where the metadata argument contains things such as uri, mime, and hit type
(in some specified order (and maybe some filtering/stemming/whatnot info)).
The AddFile method sorta replaces the "drop-in-special-dir" approach - the
drop-in-special-dir method could still be allowed for apps not talking dbus.
The AddText method should encapsulate the functionality of Beagles' current
IndexServiceRequest/Indexable duo.

> Dropping files in a special directory sounds like a thing that most
> > indexers could support. Perhaps this can be standardized. Is there a
> > place where I can find documentation/examples/code for this?
>
> Sure, the format is described in the comment at the top of the
> IndexingService backend file:
>
>
> http://svn.gnome.org/viewcvs/beagle/trunk/beagle/beagled/IndexingServiceQueryable/IndexingServiceQueryable.cs?view=markup

It seems like a compact form of what Jamie described a few mails back. If
there is a heavy data flow compactness is good, but I honestly have no clue
how much traffic there is for stuff like this...

And an example implementation is in the (sorry, ugly) Firefox extension:
>
>
> http://svn.gnome.org/viewcvs/beagle/trunk/beagle/mozilla-extension/content/beagleOverlay.js?view=markup
>
> (the beagleWriteContent() and beagleWriteMetadata() methods)
>
> >     We could maybe create an external data source backend, but since the
> >     sources are so specific, all it would amount to would be calling
> some
> >     sort of script that did the crawling and used one of the two methods
> >     above to signal Beagle.  Unlike the external filters, there hasn't
> been
> >     any demand for it, and fitting it in to the scheduler so that it
> didn't
> >     peg the indexer or fill up the disk would be tough to do externally.
> >
> >
> > I'm not sure I understand what you are saying. Is it that polling many
> > external data source "handles" would be to heavy?
>
> Sorry, I didn't describe it very well.  The issue here is that in
> Beagle, we extract only a subset of the data at a time.  For example, on
> a 300,000 message mailbox we obviously can't process them all at the
> same time.  You'd want to have some similar sort of throttling for
> external data sources, but doing that means that you basically have to
> write code to do it.  If you have to write code anyway, why not just
> make it a proper data source?

By a data source you mean something that uses IndexServiceRequests and
Indexables?

The other thing is that data sources often have to maintain state.  Our
> file system backend, for instance, has to have inotify watches for every
> directory it watches.  It has to maintain state to know which
> directories it has crawled.  It has to know about the directory tree so
> that it can handle moves correctly, etc.  To move all of these out of
> process, you'd essentially have to create yet another daemon.

Ok, I get the picture now, thanks for clearing it out.

In many cases the "daemon" would be the browser or a mail client - which
keeps a lot of state anyway, so I don't see that as a big problem. In fx. an
email client you can be pretty confident that other apps doesn't mess with
your data while you are not running, so there is no need to "watch" the
mailbox.

Cheers,
Mikkel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freedesktop.org/archives/xdg/attachments/20070218/2b7b0bd1/attachment.htm