[XESAM] Ontology sketch. Feedback needed. This time with attachment.

Thu May 31 04:50:02 PDT 2007

On Thursday 31 May 2007 12:50:24 Antoni Mylka wrote:
> Hello phreedom,
>
> For those of you who don't know me I'm currently working on a desktop
> ontology for the Nepomuk project [1] (Nepomuk Information Element
> Ontology). The current draft is available at [2].
>
> Overall. Mikkel Kamstrup has already noticed, that the notation used is
> not typical. The "Classes" are not actualy RDFS classes but "property
> categories". Otherwise the distinction you made between a File and
> Content means that these are two separate entities. Could you elaborate
> a bit more?

This is a result of the limitation that only one resource can be used to 
describe a file. There are 2 major class trees: content and source. They for 
now are subclasses of DataObject, but this may be changed e.g. in favor of 
DC. Each file gets assigned one content and one source class. There are no 
conflicting deviations from RDFS, just a subset. It might be more appropriate 
to rename Source branch to SourcedFromXXX, but I don't think it's appropriate 
here and/or will be accepted.

Current limitations:
1) One resource per file or its equivalent like message attachment or archive 
content.
2) no multi-inheritance for classes/properties
3) RDF object is always literal. Can't directly reference resources.(has 
workarounds).

> Evgeny Egorochkin pisze:
> > Hi all,
> >
> > I'd like you to take a look at the ontology sketch
> > http://www.freedesktop.org/wiki/PhreedomDraft?action=AttachFile&do=view&t
> >arget=viz.png
> >
> > It's not complete. Some fields/classes are dropped intentionally.
> > I'd like to hear some feedback first.
> >
> > Points of interest:
> > *** Sources
> > 	*Source hierarchy
> > 	*Which properties belong to content and which to source?

> Your understanding of a source seems different from mine. In Nepomuk a
> source is something, where data appears, is modified and disappears.
> Sources are monitored for data. DataObjects are extracted from sources.
> Each data source is associated with a listener, that responds to
> following events: new DataObject appeared, a DataObject has been
> modified, a DataObject has been deleted. That's why a Filesystem is a
> source, because files appear and disappear from it. A mailbox is a
> source because emails may appear or disappear from it. Likewise with an
> Addressbook and Contacts, or Calendar and Events. These distinctions are
> arbitrary and depend on the intended usage (e.g. and addressbook file
> may just as well be treated as an ordinary file in a filesystem).
>
> You seem to treat "source" as a physical representation of the content.
> I think it's very different. Setting the Content as the central concept
> in this ontology raises the level of abstraction considerably. In
> Nepomuk I tried to remain on the lower level. The content is an
> attribute of a file, not that a file is only a representation of the
> Content. It's possible that the design decisions made by the founding
> fathers of the Aperture project [3] (also exemplified in products like
> AutoFocus [4] and Aduna Metadata server [5]) have influenced this
> design, but that's a conscious decision. Staying on the lower level may
> limit the expressivity (e.g. it's more difficult if not impossible to
> express things like the content of an image hidden inside an archive
> compressed with tar, encrypted with pgp, attached to an email stored on
> an IMAP server) but it makes the task of writing extraction utilities
> easier. The extracted knowledge has proven to be useful (in [4],[5] and
> [6]) despite the limitations.

The file content is a sequence of bytes in a standard/well-known format. This 
is what content branch describes. All information in the Source comes from 
file storage/access mechanisms.

A file can be a source and content at once. e.g. We analyze zip archive. It 
gets assigned various content properties. Next its contents get analyzed and 
their source is the archive. The benefit of this approach is an ability to 
assign source-specific properties(like compressed size for archives). Also 
this decoupling serves well remote mailboxes and other stuff.

In NIE there's a similar distinction between content-embedded and 
source-provided data like creation times, but RDFS arsenal wan't used in full 
to draw the line.

Another useful aspect of this(if all else fails) is knowing the exact data 
source, giving apps an ability to add custom source-specific properties etc.

This approach doesn't seem to introduce any significant overhead(if I'm 
missing something, I'd like to know). It is already present in a rudimentary 
form in some of participating projects.
For now it will apply to the most obvious cases like archives and email, but 
the sky is the limit.

I don't think that xesam apps will use this source/content framework to 
describe mp3 as a source and container of raw uncompressed data, but who 
knows.

> > *** Multimedia ontology
>
> Do you mean to treat frameCount as frames when applied to videos and as
> samples when applied to audio data? What about the vector images?

My idea is as follows:
We describe Samples, Frames and frameCount.
Samples are atomic. Frame is an ordered set of samples.

For video/image:
Sample has bitdepth/type(int/float) and color space(or color count for 
palettes like in GIF)
Frame is width x height samples.

For audio:

Sample has bitdepth/type(int/float)

Channel count can go towards sample definition and then the frame size is 
assumed 1, or it can go towards samples per frame. Don't know which is better 
atm.

> > *** Contact ontology
>
> Not included in the image.
>
> > *** Corner cases:
> > 	* Complex file formats like databases, mailboxes.
> > 	* Problematic classes like Source code.
>
> I would treat a database file as a DataSource, or even each database
> table as a data source. Single table rows would be treated as data
> objects since they appear, are modified and disappear. The same goes
> with mailbox. For a filesystem adapter it would be a plain file, but for
> a mailbox adapter it would be a data source and emails would be treated
> as data objects because they appear inside, are modified and later deleted.
>
> > *** DataObject properties
> > 	These are the most generic ones. We need to decide whether DataObject
> > implements DC or DC is placed one level lower.
>
> I implemented following properties directly, as generic properties of a
> DataObject: nie:contributor, nie:creator, nie:description,
> nie:identifier, nie:language, nie:publisher, nie:subject, nie:title.
>
> The properties from DC element set i didn't include directly are:
>
> dc:coverage - I think that spatial and temporal coverage are somehow
> beyond the scope of a simple desktop ontology
>
> dc:date - It's too generic in my opinion. I included nie:created and
> nie:contentCreated. Both of them are subproperties of dc:date, but there
> is no nie:date (at least at the moment...)
>
> dc:format - this ones seems out of place here. NIE is all about
> describing various formats. There is plenty of detailed vocabulary to
> express format. Such a single property is too vague in my opinion.
>
> dc:relation - the same case as with dc:date. There are various relations
> like nie:isPartOf, nie:hasPart, but a generic nie:relation has not been
> included.
>
> dc:rights - this seemed too abstract at the first sight. I didn't
> include it. It seems plausible though, since various pieces of copyright
> and licensing information are often included in file metadata.
>
> dc:type - the same as with format. NIE is all about types. There is
> plenty of vocabulary to express it. dc:type is BTW explicity meant to be
> used with DC type classes, which we don't use at all. (See the entire
> [7] document, especially the lower part, with encoding schemes).
>
> I also used some properties from the extended DC Terms set like
> nie:created, nie:hasPart, nie:isPartOf. Alignment with dc terms may be
> explored further though (eg. dcterms:accessRights, dcterms:license,
> dcterms:requires, dcterms:isRequiredBy, and many more...)

I'll take a look at DCTerms.

> > *** Property interitance:
> > 	As you may have noticed, there's no sent/recv date for messages and
> > other obvious fields are missing.
> >
> > 	The idea here is that i'ts impractical to mirror all inherited fields in
> > leaf-level classes. I.e. we could have
> > contentAuthor<-documentAuthor<-textDocumentAuthor<-sourceCodeAuthor, or
> > we could use contentAuthor everywhere.
> >
> > 	That is property renaming is not a sufficient reason to make a
> > subproperty of it. All classes/file formats tend to name things quite
> > differently. i.e. Author can be: composer, coder, sender whatever. But
> > the meaning is the same.
> >
> > A rule of thumb is that parent and child properties must be essentially
> > different.
> > Child must provide some useful and meaningful implications/limitations as
> > compared to parent e.g.:
> > * controlled-vocabulary/string format/range limitations
> > * provide value grouping(generic recipient vs to/cc/bcc in email)
> > * record provenance(user-assigned keywords vs author's content-embedded
> > keywords)
>
> Well this understanding seems to be against the RDF 'spirit' (at least
> the way I understand it :-). RDF applications can use inference, at
> least the simplest one. One of the basic rules states that:
>
> if a prop b AND prop isASubpropertyOf prop2 THEN a prop2 b...
>
> That's why whenever you state that file hasComposer Beethoven and
> hasComposer is a subproperty of hasAuthor, then you'll automatically get
> file hasAuthor Beethoven. That's why the subproperty relation is between
> generic and specific. We have taken an approach to model everything as
> specific as possible, and express those common meanings with
> subpropertyOf relations. You loose information if you use generic
> properties everywhere. E.g. a detailed classical-music-oriented MP3
> library application may want to distinguish between composers,
> conductors, performers, soloists and orchestras, while a generic media
> library is quite content with a "creator" field. With RDF inference you
> get this for free.

This is exactly covered by the value grouping criteria. If there are several 
kinds of content-specific authors, they get appropriate subproperties. If 
there's only one author, be it composer/programmer or whatever else, without 
any other implications, there's no sense to create a subproperty.

Subproperties behave exactly as proposed by RDFS.

> Of course it's all a design decision. Jamie advocates simplicity and
> will probably not be interested in using RDF inference in a generic way.
> In my opinion though with the above rule it might not be that hard.
> Applications using tracker can use a limited subset of the most generic
> properties, while some simple translation tool could make use of the
> subPropertyRelations to translate the detailed information into
> understandable one. That would require dividing the ontology into at
> least two layers - basic (DC-based) and detailed (domain-specifc, audio,
> message, video, exif etc.). The developers could then choose if they
> want to understand only the basic layer (and make use of the
> subPropertyOf relations), and to go into details only in those domains
> they find interesting (e.g. all ID3 tags in an MP3 library application).
>
> I think It would be easier to reach an agreement if the solution would
> allow for different levels of detail, both during the creation of
> knowledge and during understanding. RDF has been created exactly for
> this purpose.

I expect to have DC +possibly DC Terms, Xesam core ontology, Tag Mappings for 
ID3, EXIF and others where appropriate.

> [1]
> <http://nepomuk.semanticdesktop.org>
> [2]
> <http://www.dfki.uni-kl.de/~mylka/>
> [3]
> <http://aperture.sourceforge.net>
> [4]
> <http://www.aduna-software.com:80/technologies/autofocus/overview.view>
> [5]
> <http://www.aduna-software.com:80/technologies/autofocus_server/overview.vi
>ew> [6]
> <http://www.gnowsis.org/>
> [7]
> <http://www.dublincore.org/documents/dcmi-terms/>