shared wasabi implementation
joeshaw at novell.com
Sun Feb 18 08:56:03 PST 2007
Mikkel Kamstrup Erlandsen wrote:
> 2007/2/16, Joe Shaw <joeshaw at novell.com <mailto:joeshaw at novell.com>>:
> For people who want to index their data externally we provide
> an indexing service. Apps can do one of two things: they can make an
> RPC call and pass in a document and metadata to be indexed, or they can
> drop the file into ~/.beagle/ToIndex with a control file that describes
> its metadata and Beagle will automatically index it. (This latter
> method is how the Beagle Firefox extension works.)
> What kind of rpc is available?
It's just the standard Beagle RPC mechanism (basically XML over a Unix
> Dropping files in a special directory sounds like a thing that most
> indexers could support. Perhaps this can be standardized. Is there a
> place where I can find documentation/examples/code for this?
Sure, the format is described in the comment at the top of the
IndexingService backend file:
And an example implementation is in the (sorry, ugly) Firefox extension:
(the beagleWriteContent() and beagleWriteMetadata() methods)
> We could maybe create an external data source backend, but since the
> sources are so specific, all it would amount to would be calling some
> sort of script that did the crawling and used one of the two methods
> above to signal Beagle. Unlike the external filters, there hasn't been
> any demand for it, and fitting it in to the scheduler so that it didn't
> peg the indexer or fill up the disk would be tough to do externally.
> I'm not sure I understand what you are saying. Is it that polling many
> external data source "handles" would be to heavy?
Sorry, I didn't describe it very well. The issue here is that in
Beagle, we extract only a subset of the data at a time. For example, on
a 300,000 message mailbox we obviously can't process them all at the
same time. You'd want to have some similar sort of throttling for
external data sources, but doing that means that you basically have to
write code to do it. If you have to write code anyway, why not just
make it a proper data source?
The other thing is that data sources often have to maintain state. Our
file system backend, for instance, has to have inotify watches for every
directory it watches. It has to maintain state to know which
directories it has crawled. It has to know about the directory tree so
that it can handle moves correctly, etc. To move all of these out of
process, you'd essentially have to create yet another daemon.
More information about the xdg