Fwd: Re: [Clipart] metadata: aren't the keywords actually categories (and can keywords be added)?

Fri Apr 15 21:55:43 PDT 2005

Mike Traum wrote:
> Andrew,
> I understand the issues you raise with the big xml file proposal.
> But, I don't think symbolic links will work, and I do think that hard
> links is a very messy solution to the problem. It doesn't scale well
> - what happens if openclipart has 50,000 images? So many duplications
> in the package will just end up in bloat.

What's messy about hard links?  They're a pretty good approximation to 
shared copy-on-write; no duplication at all, in fact having the same file 
in multiple directories is what they're for.  I have no idea what Windows 
thinks of them, though (haven't used Windows in a long time).

> How about a flat file structure with no path whatsoever? I think this
> would make the most sense.

That's nearly as useless for people using a file browser as un one big XML 
file.  Perhaps each package could single out one keyword as the most 
important (if the package doesn't specify, just pick one at random) and it 
could wind up in the corresponding category directory.  Or when the 
categories are made (essentially from search queries) they could be given a 
(perhaps implicit) priority.

That last suggestion in detail:
When making a localized package, one specifies a category tree; each 
category is specified by a set of keywords (or more generally, by a boolean 
  metadata expression, regular expression, etcetera).  The file gets dumped 
in the directory corresponding to the first matching category (files that 
fall through go in the root "miscellaneous" category).  A big XML catalog 
goes in the root directory as well.  An interactive tool, then, simply 
presents images as being in all matching categories.  A file browser sees 
the files in an appropriate category (but only one).  The system could even 
include (optional) hardlinks so a file appears in all the category 
directories where it belongs.

Keep in mind that OCAL and other SVG clipart collections could reach the 
multi-gigabyte mark, with hundreds of thousands of files - imagine a street 
maps collection, or mass-conversion of multiple-CD professional clipart 
collections.

A suitably designed collection format could accomodate offline storage as 
well - the index is online, but when you pick an image it tells you "That 
image is on CD #347, go get it".

> Regarding the application I'm proposing, sure, I'll be able to
> support pretty much anything. But, this all seems to be up in the air
> right now, and I'd like to see some data definitions of proposed xml
> files and a roadmap on the package structure before I start a project
> based on all of that.

There's really no hurry to design an all-encompassing database design. 
We'll throw the first one away anyway.  We should just be careful to record 
enough information in the images that we won't have to go through and 
manually change all of them later.  The whole point of using XML is so if 
we change something, it's easy to write a hack that reads the one format 
and writes the other.

Andrew