extended attribute standardization

Sun Nov 19 15:47:48 EET 2006

Jos van den Oever wrote:
> 2006/11/19, Michael Burschik <Michael.Burschik at gmx.de>:
>> > For example OpenDocument supports Dublin core, and applications
>> > writing OpenDocument could set Dublin core attributes as xattrs on the
>> > file when saving as a service for indexers etc, but since they are
>> > already persisted in the file, it would be somewhat redundant.
>> But in order to extract that information, you need to be able to read
>> the document format. Since there are too many document formats to
>> number, there would be considerable value in duplicating the information
>> in a readily readable format, such as extended attributes.
>
> Storing information that has already been stored in a file again in
> the extended attributes is problematic for two reasons: 1) it takes
> twice the space 
The storage requirements of metadata are so much smaller than those of 
the actual files that this duplication is hardly a problem.
> 2) the information can be out of sync.
That is certainly true.
> It would be much more sensible to have a standard api for getting the
> information out of the file. A good api example for a simple program
> that does this is the program 'xmlindexer' that comes with Strigi. It
> takes a file or directory as argument and extracts all information in
> universal xml format independent of the type of file.
Doesn't this require some kind of parser plugin for each and every type 
of file you intend to index? So instead of duplicating metadata, you are 
duplicating parsers? Or are you able to reuse the parser of the 
application that actually wrote the file? If you want to query a large 
number of files, then speed will be of considerable importance. Parsing 
complex files will certainly take a lot longer than reading extended 
attributes.

In an ideal world, the application would save the metadata in some 
easily accessible format and place whenever the file is modified. In 
this context, I would hesitate to call ID3 or EXIF tags easily 
accessible, for example.
>
> A few weeks back, I submitted a proposal to make a testset for
> unifying and validating the output of applications that extract
> metadata from files. The proposal included code for performing the
> validation. It requires that the output of the validation programs has
> the same format. I chose xml because it is easy to parse and it is
> easy for existing programs to add xml as an output format. Currently,
> i use this xml output for unit testing the extraction of data from
> files and it is very valuable in protecting against regressions.
>
> Cheers,
> Jos
>
Regards

Michael