2007/2/18, Joe Shaw <<a href="mailto:joeshaw@novell.com">joeshaw@novell.com</a>>:<div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Hi, Mikkel Kamstrup Erlandsen wrote: > 2007/2/16, Joe Shaw <<a href="mailto:joeshaw@novell.com">joeshaw@novell.com</a> <mailto:<a href="mailto:joeshaw@novell.com">joeshaw@novell.com</a>>>: >     For people who want to index their data externally we provide >     an indexing service.  Apps can do one of two things: they can make an >     RPC call and pass in a document and metadata to be indexed, or they can >     drop the file into ~/.beagle/ToIndex with a control file that describes >     its metadata and Beagle will automatically index it.  (This latter >     method is how the Beagle Firefox extension works.) > > What kind of rpc is available? It's just the standard Beagle RPC mechanism (basically XML over a Unix domain socket). C#: <a href="http://svn.gnome.org/viewcvs/beagle/trunk/beagle/BeagleClient/IndexingService.cs?view=markup">http://svn.gnome.org/viewcvs/beagle/trunk/beagle/BeagleClient/IndexingService.cs?view=markup </a> C: <a href="http://svn.gnome.org/viewcvs/beagle/trunk/beagle/libbeagle/beagle/beagle-indexing-service-request.h?view=markup">http://svn.gnome.org/viewcvs/beagle/trunk/beagle/libbeagle/beagle/beagle-indexing-service-request.h?view=markup </a></blockquote><div> Ok. If we are to standardize something like this, I would assume that we use dbus for rpc - as far as I can tell that doesn't seem to be a problem..? Fx a dbus api like:  - AddFile (in as metadata, in s input_file)  - AddText (in as metadata, in s text) where the metadata argument contains things such as uri, mime, and hit type (in some specified order (and maybe some filtering/stemming/whatnot info)). The AddFile method sorta replaces the "drop-in-special-dir" approach - the drop-in-special-dir method could still be allowed for apps not talking dbus. The AddText method should encapsulate the functionality of Beagles' current IndexServiceRequest/Indexable duo. </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">> Dropping files in a special directory sounds like a thing that most > indexers could support. Perhaps this can be standardized. Is there a > place where I can find documentation/examples/code for this? Sure, the format is described in the comment at the top of the IndexingService backend file: <a href="http://svn.gnome.org/viewcvs/beagle/trunk/beagle/beagled/IndexingServiceQueryable/IndexingServiceQueryable.cs?view=markup">http://svn.gnome.org/viewcvs/beagle/trunk/beagle/beagled/IndexingServiceQueryable/IndexingServiceQueryable.cs?view=markup </a></blockquote><div> It seems like a compact form of what Jamie described a few mails back. If there is a heavy data flow compactness is good, but I honestly have no clue how much traffic there is for stuff like this... </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">And an example implementation is in the (sorry, ugly) Firefox extension: <a href="http://svn.gnome.org/viewcvs/beagle/trunk/beagle/mozilla-extension/content/beagleOverlay.js?view=markup">http://svn.gnome.org/viewcvs/beagle/trunk/beagle/mozilla-extension/content/beagleOverlay.js?view=markup </a> (the beagleWriteContent() and beagleWriteMetadata() methods) >     We could maybe create an external data source backend, but since the >     sources are so specific, all it would amount to would be calling some >     sort of script that did the crawling and used one of the two methods >     above to signal Beagle.  Unlike the external filters, there hasn't been >     any demand for it, and fitting it in to the scheduler so that it didn't >     peg the indexer or fill up the disk would be tough to do externally. > > > I'm not sure I understand what you are saying. Is it that polling many > external data source "handles" would be to heavy? Sorry, I didn't describe it very well.  The issue here is that in Beagle, we extract only a subset of the data at a time.  For example, on a 300,000 message mailbox we obviously can't process them all at the same time.  You'd want to have some similar sort of throttling for external data sources, but doing that means that you basically have to write code to do it.  If you have to write code anyway, why not just make it a proper data source?</blockquote><div> By a data source you mean something that uses IndexServiceRequests and Indexables?  </div> <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> The other thing is that data sources often have to maintain state.  Our file system backend, for instance, has to have inotify watches for every directory it watches.  It has to maintain state to know which directories it has crawled.  It has to know about the directory tree so that it can handle moves correctly, etc.  To move all of these out of process, you'd essentially have to create yet another daemon.</blockquote><div> Ok, I get the picture now, thanks for clearing it out. In many cases the "daemon" would be the browser or a mail client - which keeps a lot of state anyway, so I don't see that as a big problem. In fx. an email client you can be pretty confident that other apps doesn't mess with your data while you are not running, so there is no need to "watch" the mailbox. Cheers, Mikkel </div> </div>