[XESAM] Ontology sketch. Feedback needed. This time with attachment.
antoni.mylka at dfki.uni-kl.de
Thu May 31 02:50:24 PDT 2007
For those of you who don't know me I'm currently working on a desktop
ontology for the Nepomuk project  (Nepomuk Information Element
Ontology). The current draft is available at .
Overall. Mikkel Kamstrup has already noticed, that the notation used is
not typical. The "Classes" are not actualy RDFS classes but "property
categories". Otherwise the distinction you made between a File and
Content means that these are two separate entities. Could you elaborate
a bit more?
Evgeny Egorochkin pisze:
> Hi all,
> I'd like you to take a look at the ontology sketch
> It's not complete. Some fields/classes are dropped intentionally.
> I'd like to hear some feedback first.
> Points of interest:
> *** Sources
> *Source hierarchy
> *Which properties belong to content and which to source?
Your understanding of a source seems different from mine. In Nepomuk a
source is something, where data appears, is modified and disappears.
Sources are monitored for data. DataObjects are extracted from sources.
Each data source is associated with a listener, that responds to
following events: new DataObject appeared, a DataObject has been
modified, a DataObject has been deleted. That's why a Filesystem is a
source, because files appear and disappear from it. A mailbox is a
source because emails may appear or disappear from it. Likewise with an
Addressbook and Contacts, or Calendar and Events. These distinctions are
arbitrary and depend on the intended usage (e.g. and addressbook file
may just as well be treated as an ordinary file in a filesystem).
You seem to treat "source" as a physical representation of the content.
I think it's very different. Setting the Content as the central concept
in this ontology raises the level of abstraction considerably. In
Nepomuk I tried to remain on the lower level. The content is an
attribute of a file, not that a file is only a representation of the
Content. It's possible that the design decisions made by the founding
fathers of the Aperture project  (also exemplified in products like
AutoFocus  and Aduna Metadata server ) have influenced this
design, but that's a conscious decision. Staying on the lower level may
limit the expressivity (e.g. it's more difficult if not impossible to
express things like the content of an image hidden inside an archive
compressed with tar, encrypted with pgp, attached to an email stored on
an IMAP server) but it makes the task of writing extraction utilities
easier. The extracted knowledge has proven to be useful (in , and
) despite the limitations.
> *** Multimedia ontology
Do you mean to treat frameCount as frames when applied to videos and as
samples when applied to audio data? What about the vector images?
> *** Contact ontology
Not included in the image.
> *** Corner cases:
> * Complex file formats like databases, mailboxes.
> * Problematic classes like Source code.
I would treat a database file as a DataSource, or even each database
table as a data source. Single table rows would be treated as data
objects since they appear, are modified and disappear. The same goes
with mailbox. For a filesystem adapter it would be a plain file, but for
a mailbox adapter it would be a data source and emails would be treated
as data objects because they appear inside, are modified and later deleted.
> *** DataObject properties
> These are the most generic ones. We need to decide whether DataObject
> implements DC or DC is placed one level lower.
I implemented following properties directly, as generic properties of a
DataObject: nie:contributor, nie:creator, nie:description,
nie:identifier, nie:language, nie:publisher, nie:subject, nie:title.
The properties from DC element set i didn't include directly are:
dc:coverage - I think that spatial and temporal coverage are somehow
beyond the scope of a simple desktop ontology
dc:date - It's too generic in my opinion. I included nie:created and
nie:contentCreated. Both of them are subproperties of dc:date, but there
is no nie:date (at least at the moment...)
dc:format - this ones seems out of place here. NIE is all about
describing various formats. There is plenty of detailed vocabulary to
express format. Such a single property is too vague in my opinion.
dc:relation - the same case as with dc:date. There are various relations
like nie:isPartOf, nie:hasPart, but a generic nie:relation has not been
dc:rights - this seemed too abstract at the first sight. I didn't
include it. It seems plausible though, since various pieces of copyright
and licensing information are often included in file metadata.
dc:type - the same as with format. NIE is all about types. There is
plenty of vocabulary to express it. dc:type is BTW explicity meant to be
used with DC type classes, which we don't use at all. (See the entire
 document, especially the lower part, with encoding schemes).
I also used some properties from the extended DC Terms set like
nie:created, nie:hasPart, nie:isPartOf. Alignment with dc terms may be
explored further though (eg. dcterms:accessRights, dcterms:license,
dcterms:requires, dcterms:isRequiredBy, and many more...)
> *** Property interitance:
> As you may have noticed, there's no sent/recv date for messages and other
> obvious fields are missing.
> The idea here is that i'ts impractical to mirror all inherited fields in
> leaf-level classes. I.e. we could have
> contentAuthor<-documentAuthor<-textDocumentAuthor<-sourceCodeAuthor, or we
> could use contentAuthor everywhere.
> That is property renaming is not a sufficient reason to make a subproperty of
> it. All classes/file formats tend to name things quite differently. i.e.
> Author can be: composer, coder, sender whatever. But the meaning is the same.
> A rule of thumb is that parent and child properties must be essentially
> Child must provide some useful and meaningful implications/limitations as
> compared to parent e.g.:
> * controlled-vocabulary/string format/range limitations
> * provide value grouping(generic recipient vs to/cc/bcc in email)
> * record provenance(user-assigned keywords vs author's content-embedded
Well this understanding seems to be against the RDF 'spirit' (at least
the way I understand it :-). RDF applications can use inference, at
least the simplest one. One of the basic rules states that:
if a prop b AND prop isASubpropertyOf prop2 THEN a prop2 b...
That's why whenever you state that file hasComposer Beethoven and
hasComposer is a subproperty of hasAuthor, then you'll automatically get
file hasAuthor Beethoven. That's why the subproperty relation is between
generic and specific. We have taken an approach to model everything as
specific as possible, and express those common meanings with
subpropertyOf relations. You loose information if you use generic
properties everywhere. E.g. a detailed classical-music-oriented MP3
library application may want to distinguish between composers,
conductors, performers, soloists and orchestras, while a generic media
library is quite content with a "creator" field. With RDF inference you
get this for free.
Of course it's all a design decision. Jamie advocates simplicity and
will probably not be interested in using RDF inference in a generic way.
In my opinion though with the above rule it might not be that hard.
Applications using tracker can use a limited subset of the most generic
properties, while some simple translation tool could make use of the
subPropertyRelations to translate the detailed information into
understandable one. That would require dividing the ontology into at
least two layers - basic (DC-based) and detailed (domain-specifc, audio,
message, video, exif etc.). The developers could then choose if they
want to understand only the basic layer (and make use of the
subPropertyOf relations), and to go into details only in those domains
they find interesting (e.g. all ID3 tags in an MP3 library application).
I think It would be easier to reach an agreement if the solution would
allow for different levels of detail, both during the creation of
knowledge and during understanding. RDF has been created exactly for
antoni.mylka at dfki.de
More information about the xdg