mimetype standardisation by testsets

Mon Oct 30 00:38:26 EET 2006

Hi All,

Quite a few programs extract metadata from files. Nautilus, Konqueror,
Beagle, Strigi, and Tracker are a few programs that extract metadata
from all available files. To make sure these programs behave in the
same way, it would be good to standardize on the metadata that can be
gotten from different filetypes.

For this purpose I've written a small program that takes output from a
metadata extraction tool and compares this output with a reference
output and generates a report from this comparison. This program is
useful for unittesting within one project (Strigi in my case) but it
is also useful to ensure that different programs extract the exact
same metadata from identical files.

This program is written in java and the first version along with a
minimal test set has been attached to this email. The program can be
run as follows:
  cd code; java MetaDataValidator mymetadataprogram ../data
This will generate an index.html file with a comparison report.
(note that you need to compile the program first with 'make'.)

The 'mymetadataprogram' argument is the name of a small wrapper you
need to write around you extraction tool. This tool should write the
metadata as xml to the standard output. The xml format looks like
this:
<metadata>
 <file uri="email/mail" mtime='1161810601'>
  <value name='size'>2801</value>
  <value name='author'>Pierre</value>
  <value name='author'>Jane</value>
 </file>
 <file uri="email/mail2" mtime='1161810538'>
  <value name='subject'>Holiday reenforcements</value>
 </file>
<file>
</metadata>

The comparison follows a few simple rules that basically all come from
one rule: there should be no conflicting information. It's not so bad
to be incomplete or overcomplete, but having different data under the
same field name is really wrong. So:
- if a program extracts metadata under a name that is present in the
reference and with the same value (or same multiple values) that is
correct.
- if it extracts the metadata under a name that is in the reference
but with a different value, that is the biggest possible error
- if it extracts data under a name not present in the reference,
that's not such a big deal, the new field type should go into the
reference at some point
- if it doesn't find a certain file, e.g. because it is nested in
another file, that is also not such a big deal.

The report output is currently very simple. I've attached an example
output. The tough part now will be to build the wrappers around the
different extraction programs and to build up the test set. Once the
test set grows we can start making the reports nicer.

What do you think of this initiative?

Cheers,
Jos
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testset.tar.bz2
Type: application/x-bzip2
Size: 21077 bytes
Desc: not available
Url : http://lists.freedesktop.org/archives/xdg/attachments/20061029/c7c34cbc/attachment.bin