[XESAM] Ontology sketch. Feedback needed. This time with attachment.

Antoni Mylka antoni.mylka at dfki.uni-kl.de
Thu May 31 02:50:24 PDT 2007


Hello phreedom,

For those of you who don't know me I'm currently working on a desktop 
ontology for the Nepomuk project [1] (Nepomuk Information Element 
Ontology). The current draft is available at [2].

Overall. Mikkel Kamstrup has already noticed, that the notation used is 
not typical. The "Classes" are not actualy RDFS classes but "property 
categories". Otherwise the distinction you made between a File and 
Content means that these are two separate entities. Could you elaborate 
a bit more?

Evgeny Egorochkin pisze:
> 
> Hi all,
> 
> I'd like you to take a look at the ontology sketch
> http://www.freedesktop.org/wiki/PhreedomDraft?action=AttachFile&do=view&target=viz.png
> 
> It's not complete. Some fields/classes are dropped intentionally.
> I'd like to hear some feedback first.
> 
> Points of interest:
> *** Sources
> 	*Source hierarchy
> 	*Which properties belong to content and which to source?

Your understanding of a source seems different from mine. In Nepomuk a 
source is something, where data appears, is modified and disappears. 
Sources are monitored for data. DataObjects are extracted from sources. 
Each data source is associated with a listener, that responds to 
following events: new DataObject appeared, a DataObject has been 
modified, a DataObject has been deleted. That's why a Filesystem is a 
source, because files appear and disappear from it. A mailbox is a 
source because emails may appear or disappear from it. Likewise with an 
Addressbook and Contacts, or Calendar and Events. These distinctions are 
arbitrary and depend on the intended usage (e.g. and addressbook file 
may just as well be treated as an ordinary file in a filesystem).

You seem to treat "source" as a physical representation of the content. 
I think it's very different. Setting the Content as the central concept 
in this ontology raises the level of abstraction considerably. In 
Nepomuk I tried to remain on the lower level. The content is an 
attribute of a file, not that a file is only a representation of the 
Content. It's possible that the design decisions made by the founding 
fathers of the Aperture project [3] (also exemplified in products like 
AutoFocus [4] and Aduna Metadata server [5]) have influenced this 
design, but that's a conscious decision. Staying on the lower level may 
limit the expressivity (e.g. it's more difficult if not impossible to 
express things like the content of an image hidden inside an archive 
compressed with tar, encrypted with pgp, attached to an email stored on 
an IMAP server) but it makes the task of writing extraction utilities 
easier. The extracted knowledge has proven to be useful (in [4],[5] and 
[6]) despite the limitations.

> *** Multimedia ontology

Do you mean to treat frameCount as frames when applied to videos and as 
samples when applied to audio data? What about the vector images?

> *** Contact ontology

Not included in the image.

> *** Corner cases:
> 	* Complex file formats like databases, mailboxes.
> 	* Problematic classes like Source code.

I would treat a database file as a DataSource, or even each database 
table as a data source. Single table rows would be treated as data 
objects since they appear, are modified and disappear. The same goes 
with mailbox. For a filesystem adapter it would be a plain file, but for 
a mailbox adapter it would be a data source and emails would be treated 
as data objects because they appear inside, are modified and later deleted.

> *** DataObject properties
> 	These are the most generic ones. We need to decide whether DataObject 
> implements DC or DC is placed one level lower.

I implemented following properties directly, as generic properties of a 
DataObject: nie:contributor, nie:creator, nie:description, 
nie:identifier, nie:language, nie:publisher, nie:subject, nie:title.

The properties from DC element set i didn't include directly are:

dc:coverage - I think that spatial and temporal coverage are somehow 
beyond the scope of a simple desktop ontology

dc:date - It's too generic in my opinion. I included nie:created and 
nie:contentCreated. Both of them are subproperties of dc:date, but there 
is no nie:date (at least at the moment...)

dc:format - this ones seems out of place here. NIE is all about 
describing various formats. There is plenty of detailed vocabulary to 
express format. Such a single property is too vague in my opinion.

dc:relation - the same case as with dc:date. There are various relations 
like nie:isPartOf, nie:hasPart, but a generic nie:relation has not been 
included.

dc:rights - this seemed too abstract at the first sight. I didn't 
include it. It seems plausible though, since various pieces of copyright 
and licensing information are often included in file metadata.

dc:type - the same as with format. NIE is all about types. There is 
plenty of vocabulary to express it. dc:type is BTW explicity meant to be 
used with DC type classes, which we don't use at all. (See the entire 
[7] document, especially the lower part, with encoding schemes).

I also used some properties from the extended DC Terms set like
nie:created, nie:hasPart, nie:isPartOf. Alignment with dc terms may be 
explored further though (eg. dcterms:accessRights, dcterms:license, 
dcterms:requires, dcterms:isRequiredBy, and many more...)

  	
> *** Property interitance:
> 	As you may have noticed, there's no sent/recv date for messages and other 
> obvious fields are missing.
> 
> 	The idea here is that i'ts impractical to mirror all inherited fields in 
> leaf-level classes. I.e. we could have 
> contentAuthor<-documentAuthor<-textDocumentAuthor<-sourceCodeAuthor, or we 
> could use contentAuthor everywhere.
> 
> 	That is property renaming is not a sufficient reason to make a subproperty of 
> it. All classes/file formats tend to name things quite differently. i.e. 
> Author can be: composer, coder, sender whatever. But the meaning is the same.
> 
> A rule of thumb is that parent and child properties must be essentially 
> different. 
> Child must provide some useful and meaningful implications/limitations as 
> compared to parent e.g.:
> * controlled-vocabulary/string format/range limitations
> * provide value grouping(generic recipient vs to/cc/bcc in email)
> * record provenance(user-assigned keywords vs author's content-embedded 
> keywords)
> 

Well this understanding seems to be against the RDF 'spirit' (at least 
the way I understand it :-). RDF applications can use inference, at 
least the simplest one. One of the basic rules states that:

if a prop b AND prop isASubpropertyOf prop2 THEN a prop2 b...

That's why whenever you state that file hasComposer Beethoven and 
hasComposer is a subproperty of hasAuthor, then you'll automatically get 
file hasAuthor Beethoven. That's why the subproperty relation is between 
generic and specific. We have taken an approach to model everything as 
specific as possible, and express those common meanings with 
subpropertyOf relations. You loose information if you use generic 
properties everywhere. E.g. a detailed classical-music-oriented MP3 
library application may want to distinguish between composers, 
conductors, performers, soloists and orchestras, while a generic media 
library is quite content with a "creator" field. With RDF inference you 
get this for free.

Of course it's all a design decision. Jamie advocates simplicity and 
will probably not be interested in using RDF inference in a generic way. 
In my opinion though with the above rule it might not be that hard. 
Applications using tracker can use a limited subset of the most generic 
properties, while some simple translation tool could make use of the 
subPropertyRelations to translate the detailed information into 
understandable one. That would require dividing the ontology into at 
least two layers - basic (DC-based) and detailed (domain-specifc, audio, 
message, video, exif etc.). The developers could then choose if they 
want to understand only the basic layer (and make use of the 
subPropertyOf relations), and to go into details only in those domains 
they find interesting (e.g. all ID3 tags in an MP3 library application).

I think It would be easier to reach an agreement if the solution would 
allow for different levels of detail, both during the creation of 
knowledge and during understanding. RDF has been created exactly for 
this purpose.

Antoni Mylka
antoni.mylka at dfki.de

[1]
<http://nepomuk.semanticdesktop.org>
[2]
<http://www.dfki.uni-kl.de/~mylka/>
[3]
<http://aperture.sourceforge.net>
[4]
<http://www.aduna-software.com:80/technologies/autofocus/overview.view>
[5]
<http://www.aduna-software.com:80/technologies/autofocus_server/overview.view>
[6]
<http://www.gnowsis.org/>
[7]
<http://www.dublincore.org/documents/dcmi-terms/>


More information about the xdg mailing list