[RFC] Metadata access and storage

Wed Sep 7 14:17:53 PDT 2011

Hi,

On Tue, 2011-09-06 at 17:28 +0200, Anders Feder wrote:
> Den 06-09-2011 16:03, Michael Pyne skrev: 
> > I mean let's face it, the reason the job hasn't been done yet is because the 
> > job is enormous, not simply because the correct library hasn't been invented 
> > yet. This is all not helped by the fact that most developers have zero 
> > inclination to do the extra work to describe ontologies and use semantic 
> > layers (similar in my mind to the choice between using plain text files for 
> > simple config or using a full-blown SQL database). Simply making up a 
> > different backend/semantic interface is not going to help matters unless that 
> > new interface is /significantly easier/ to develop against (and then why not 
> > just port that interface over to the existing frameworks?)
> What makes you think that the developers are willing to use the
> existing frameworks if only they were easier to use? The concerns I've
> heard over using e.g. Tracker as a backend have mainly been related to
> performance.

In my experience from helping application developers use Tracker
efficiently, ease of use and performance are both relevant concerns. I
have some ideas on where I'd like things to go to improve the experience
for application developers (and users).

One major issue is that almost nobody knows SPARQL (yet), and while the
query language fits well into the world of RDF, it takes some time to
get your head around it. Tracker uses SPARQL as the lowest level
interface for all queries and updates. However, even if you know SPARQL,
it's generally impossible to predict or optimize performance of a query
- unless you're familiar with the performance characteristics of the
SPARQL implementation. In contrast to SQL, there is no standardized way
of managing indices.

I'm currently favoring a more modular approach where we define a core
storage API that is based on the RDF model but is kept much simpler.
That is, I would no longer use SPARQL (or any other query language) on
the lowest level and instead provide a simple CRUD API on the level of
RDF resources. Applications could then start sharing their data by
porting just their update operations to the storage API while keeping
their optimized SQL queries as they are. With a first-class sync API it
would be easy to keep the RDF store and the application database in sync
at all times. The application database would then merely act as a cache
and could be rebuilt at any time.

This core storage API can obviously be implemented by existing RDF
stores - if they are able to support synchronization. However, if the
existing RDF stores do not fit your purposes for one reason or another,
it will be fairly easy to implement a new store with the same API as the
most complex part, SPARQL, is not required on this level.

This allows applications to use whatever query language and indices they
want to use. SQL, SPARQL, Berkeley DB, or Lucene, it's all up to the
applications. Whatever they choose, they can still share their data with
other applications using the same or a different query language. And
getting more applications to share their data is what we should be
aiming for, in my opinion.

Regards,
Jürg