[Aperture-devel] Proposal for extending the DataSource interface
antoni.mylka at dfki.uni-kl.de
Thu Jun 14 06:39:08 PDT 2007
I take the liberty of sending this answer to aperture-devel,
freedesktop.org and Nepomuk Taskforce Ontologies mailing lists. These
are three groups currently working on ontologies for the semantic
desktop. All feedback is invaluable.
Leo Sauermann pisze:
> Hi Chris,
> for short: crawlers are not the only way to get the data, we want to be
> open to synchronisers that watch the datasource for changes and do
> the described way was the solution we had in gnowsis, where we used HTTP
> uris to identify DataObjects that do not come from a web-server.
> A constructive solution to the problem would involve a way to use HTTP
> uris for dataobjects that are not necessarily http accessible,
> and for URNs. (urn:isbn:123123-123123-12312)
> no clue how to solve it otherway antoni said, any ideas welcome.
And this is another thing we're investigating at the moment. The fifth
draft of NIE (due to appear this week) revolves around separation
between data and information. Basically each DataObject is expected to
have TWO types. The first one is a representation type, one the
subclasses of DataObject (FileEntity, Attachment, ArchiveItem,
ContactListItem, MailboxItem...) the second one is the interpretation
type (Folder, Archive, FilesystemImage, Message, Contact etc.).
This design has been inspired by the work currently done by the XESAM
project . They want to unify the metadata used by major open-source
desktop search systems (Strigi, Beagle, Tracker, Pinot, Recoll).
It also allows for much flexibility. A file may be interpreted as a
Mailbox (like the thunderbird one) or a Message (.eml) or a Contact
(.vcf). An Attachment can have the same interpretation as a file, it is
possible to have a CDimage file and interpret it as a Filesystem. It
will make the description of data orthogonal to the design of Aperture
(e.g messages extracted from a file-based mailbox will look the same
regardless of whether we have a mailbox extractor extending the
filesystem crawler, or a dedicated file mailbox crawler backed by a
FileMailboxDataSource) ... More detailed description will come with the
NIE draft 5 specification. The way I see it it would fit into Aperture
quite well, without any architecture changes.
The problem Leo speaks about is to bring this separation even further.
So that basically each file can yield TWO resources - the representation
(whose uri will begin with file://) and the interpretation (with
urn:isbn, or urn:doi, or urn:messageId ...). We are aware that this idea
would hardly fit into the current Aperture architecture. The XESAM
people will have their doubts too.
We think nevertheless that it would be 'right' from the semantic point
of view. You could annotate a doi:// item, regardless of its
representation (file, http), the annotations would not need to reflect
the changes in the representation (e.g. a file has been moved or
copied). What's more, if I send the annotations to someone else, they
will remain valid on his/her computer even if he/she has the same file
somewhere else, or doesn't have it at all.
It's a research topic
- how many types of content have such unique identifiers (doi, isbn,
- what percentage of all desktop resources has those identifiers
- what use cases would be made possible by such a separation
- do any real people actually need those use cases
- how difficult would it be to implement it (e.g. is it possible to
extract the ISBN from a PDF without any complicated NLP heuristics).
Coming up with a solution that will satisfy everyone requires
discussion. All feedback is invaluable. Please write what do you think
antoni.mylka at dfki.de
More information about the xdg