shared wasabi implementation

Joe Shaw joeshaw at
Sun Feb 18 08:56:03 PST 2007


Mikkel Kamstrup Erlandsen wrote:
> 2007/2/16, Joe Shaw <joeshaw at <mailto:joeshaw at>>:
>     For people who want to index their data externally we provide
>     an indexing service.  Apps can do one of two things: they can make an
>     RPC call and pass in a document and metadata to be indexed, or they can
>     drop the file into ~/.beagle/ToIndex with a control file that describes
>     its metadata and Beagle will automatically index it.  (This latter
>     method is how the Beagle Firefox extension works.)
> What kind of rpc is available?

It's just the standard Beagle RPC mechanism (basically XML over a Unix 
domain socket).



> Dropping files in a special directory sounds like a thing that most 
> indexers could support. Perhaps this can be standardized. Is there a 
> place where I can find documentation/examples/code for this?

Sure, the format is described in the comment at the top of the 
IndexingService backend file:

And an example implementation is in the (sorry, ugly) Firefox extension:

(the beagleWriteContent() and beagleWriteMetadata() methods)

>     We could maybe create an external data source backend, but since the
>     sources are so specific, all it would amount to would be calling some
>     sort of script that did the crawling and used one of the two methods
>     above to signal Beagle.  Unlike the external filters, there hasn't been
>     any demand for it, and fitting it in to the scheduler so that it didn't
>     peg the indexer or fill up the disk would be tough to do externally.
> I'm not sure I understand what you are saying. Is it that polling many 
> external data source "handles" would be to heavy?

Sorry, I didn't describe it very well.  The issue here is that in 
Beagle, we extract only a subset of the data at a time.  For example, on 
a 300,000 message mailbox we obviously can't process them all at the 
same time.  You'd want to have some similar sort of throttling for 
external data sources, but doing that means that you basically have to 
write code to do it.  If you have to write code anyway, why not just 
make it a proper data source?

The other thing is that data sources often have to maintain state.  Our 
file system backend, for instance, has to have inotify watches for every 
directory it watches.  It has to maintain state to know which 
directories it has crawled.  It has to know about the directory tree so 
that it can handle moves correctly, etc.  To move all of these out of 
process, you'd essentially have to create yet another daemon.


More information about the xdg mailing list