[Aperture-devel] Proposal for extending the DataSource interface

Thu Jun 14 06:39:08 PDT 2007

I take the liberty of sending this answer to aperture-devel, 
freedesktop.org and Nepomuk Taskforce Ontologies mailing lists. These 
are three groups currently working on ontologies for the semantic 
desktop. All feedback is invaluable.

Leo Sauermann pisze:
> Hi Chris,
> 
> for short: crawlers are not the only way to get the data, we want to be 
> open to synchronisers that watch the datasource for changes and do 
> notifications.
> 
> the described way was the solution we had in gnowsis, where we used HTTP 
> uris to identify DataObjects that do not come from a web-server.
> 
> A constructive solution to the problem would involve a way to use HTTP 
> uris for dataobjects that are not necessarily http accessible,
> and for URNs. (urn:isbn:123123-123123-12312)
> 
> no clue how to solve it otherway antoni said, any ideas welcome.
> 

And this is another thing we're investigating at the moment. The fifth 
draft of NIE (due to appear this week) revolves around separation 
between data and information. Basically each DataObject is expected to 
have TWO types. The first one is a representation type, one the 
subclasses of DataObject (FileEntity, Attachment, ArchiveItem, 
ContactListItem, MailboxItem...) the second one is the interpretation 
type (Folder, Archive, FilesystemImage, Message, Contact etc.).

This design has been inspired by the work currently done by the XESAM 
project [1]. They want to unify the metadata used by major open-source 
desktop search systems (Strigi, Beagle, Tracker, Pinot, Recoll).

It also allows for much flexibility. A file may be interpreted as a 
Mailbox (like the thunderbird one) or a Message (.eml) or a Contact 
(.vcf). An Attachment can have the same interpretation as a file, it is 
possible to have a CDimage file and interpret it as a Filesystem. It 
will make the description of data orthogonal to the design of Aperture 
(e.g messages extracted from a file-based mailbox will look the same 
regardless of whether we have a mailbox extractor extending the 
filesystem crawler, or a dedicated file mailbox crawler backed by a 
FileMailboxDataSource) ... More detailed description will come with the 
NIE draft 5 specification. The way I see it it would fit into Aperture 
quite well, without any architecture changes.

The problem Leo speaks about is to bring this separation even further. 
So that basically each file can yield TWO resources - the representation 
(whose uri will begin with file://) and the interpretation (with 
urn:isbn, or urn:doi, or urn:messageId ...). We are aware that this idea 
would hardly fit into the current Aperture architecture. The XESAM 
people will have their doubts too.

We think nevertheless that it would be 'right' from the semantic point 
of view. You could annotate a doi:// item, regardless of its 
representation (file, http), the annotations would not need to reflect 
the changes in the representation (e.g. a file has been moved or 
copied). What's more, if I send the annotations to someone else, they 
will remain valid on his/her computer even if he/she has the same file 
somewhere else, or doesn't have it at all.

It's a research topic
- how many types of content have such unique identifiers (doi, isbn, 
uuid, messageId)
- what percentage of all desktop resources has those identifiers
- what use cases would be made possible by such a separation
- do any real people actually need those use cases
- how difficult would it be to implement it (e.g. is it possible to 
extract the ISBN from a PDF without any complicated NLP heuristics).
- etc...

Coming up with a solution that will satisfy everyone requires 
discussion. All feedback is invaluable. Please write what do you think 
about it.

Antoni Mylka
antoni.mylka at dfki.de

[1] http://freedesktop.org/wiki/XesamAbout