[Xesam] Rewriting the build tools

Tue Jul 21 08:06:29 PDT 2009

Xesamies,

Abstract:
Just wanted you to know what the build tools do, before we all embark on
the rewriting quest. There are two potentially ones. The first requires
RDFS inference, the second one proper NRL inference. I doubt it's worth
the hassle of rewriting them. Don't want to force anything on anyone,
it's just that dropping it all just because java is 'bad' doesn't seem
like a good idea to me.

Full version:

There are five programs in NIEOntologyUtils. They are all plain CLI
applications with a main method that gets all input via command line
arguments. The ant script doesn't use anything java specific, just the
<java> target to run a program and some basic fs-related stuff like
<copy>, <mkdir> etc. It could have been a shell script but we chose ant
because it works everywhere and we didn't have to maintain both bash and
.bat scripts to run it.

1. RDFS2Tex - the doc generator. It has two "renderers": one that
outputs LaTeX and one that ouputs HTML.

Input:
  - a directory containing rdf files,
  - the namespace to document
  - a list of "local" namespaces (inside NIE, like NFO, NMO, NCO...)
Output:
  - a html file with the documentation of classes and properties
    from the given namespace,
  - all references to entities from "local" namespaces are turned to
    links, external entities are not turned to links
Algorithm:
  - take triples from all files into a single model
  - apply RDFS inference to get all sub/superclasses and
    sub/superproperties
  - iterate over all classes from the given namespace and generate
    output
  - iterate over all properties from the given namespace and generate
    output
Rewriting
  - could probably be much clearer with a reasonable object model and
    a template engine, i.e. with ruby/redland and erb, or with
    java/velocity
  - if in java, it could be ported to sesame to drop the number of jars
    in the 'lib' folder from 18 to 3 (sesame and two slf4j jars)
  - but requires RDFS inference to work properly, otherwise you won't
    get those full lists of (sub|super)(classes|properties), AFAIK
    redland doesn't have this and using virtuoso seems like overkill
Quality:
  - mediocre, it cries for a template engine, but it seems to work

2. RDFS2NRL - an ugly hack, seems obsolete now

Input
   - the RDF/XML files containing the ontology expressed the way Protege
     3.0 used to work i.e.
     an .rdfs file with the classes, an .rdf file with instances and a
     -nrl.ttl file, a hack for PIMO
   - metadata options (author, namespace, etc. all you see in the
     metadatagraphs)
Output
   - trig file with a NRL version of the ontology and the metadata graph
   - two rdf/xml files, one with ontology, second with metadata
   - the ontology is in NRL, the input files contained constructs
     specific to the old protege, like (max|min)Cardinality constraints
     or Inverse/InverseFunctional properties. all of them have been
     converted from the protege namespace to their NRL equivalents
Rewriting
   - I think it is a much better idea to write a 50-line script, now
     that we don't need those conversions from protege
Quality
   - works, but is obsolete in the XESAM/OSCAF setup

3,4,5. the converters are completely obsolete, should be deleted

6. Another tool we developed in NEPOMUK is the nrlvalidator (which
requires unionsail and infsail). There is some CLI interface, but it was
meant to be accessed with a java API. They currently live within the
aperture-tools package in the aperture sourceforge repository.

Input
   - a set of ontologies
   - a piece of data
Output: a report of:
   - ontology errors like:
      - class is a subclass of an undefined class
      - a subproperty has a domain which is not a subclass of the
        domain of the superproperty
      - the same for range
   - data errors like
      - if a property occurs in data, the resources on both sides
        must have a type that is a subclass of the domain/range of
        that property
      - the above works for datatype properties
      - basic datatype syntax validation (e.g. if a property has a
        xsd:dateTime range, then the value must be a literal, with the
        rdf datatype xsd:dateTime, and it's lexical form must match
        the xsd:dateTime regex)
      - cardinality constraint violations
Rewriting:
    - very difficult, would require a rule-based inference engine,
      there is one in Jena, and infsail is a port from the Jena-one
      to sesame made by Gunnar Grimnes from DFKI. IMHO nearly impossible
      with redland, dunno about virtuoso, or CWM
    - it's IMHO important because such a tool would be the only way
      to ensure we deliver on the backward-compatiblity promise
Quality:
    - seems to work pretty well, we used it extensively in Aperture unit
      tests and it allowed us to spot a fair amount of bugs related to
      Aperture generating invalid data

All kinds of comments welcome:

Antoni Mylka
antoni.mylka at gmail.com