[Clipart] metadata: aren't the keywords actually categories (and can keywords be added)?

Thu Apr 14 09:54:19 PDT 2005

Mike Traum wrote:
> Hi,
> Based on the way the packages are layed out (the paths of the files),
> it looks like <dc:subject/> is being used to store the category, and
> the open clip art project doesn't really have a concept of keywords
> yet.

The keywords are used to inform categories, and a given item may have
multiple keywords (and so may be in multiple categories) -- but also,
in the long term, localized packages may have categories that draw
in items with multiple keywords.

> For example, if an you had the following categories:
> - Animals->Dogs
> - People->Men

> and you had an image with a dog and a man in it, how would this be
> represented?

The image would probably have the keywords "animal", "mammal", "dog",
"people", "man", and possibly others, as relevant.  The Animals
category would draw in all items with the "animal" keyword, unless they
are in one of its subcategories, such as Dogs.  The Dogs category would
draw in all images with the "dog" keyword (and so this image would
appear there).  The People category would draw in all images with the
"people" keyword, unless they are in one of its subcategories; the Men
category, which we currently don't have, would presumably draw in
images with the "man" keyword, and so this hypothetical image would
appear there.  So it would be in the Dogs category, and also in the
Men category.

> And, if that man was balding, would you have a whole category for
> balding men? 

If we had fifty images of balding men, we could consider that.  However,
keywords that don't correspond to a currently extant category should not
create a problem for our organizational system, and indeed, we have
numerous images in the collection with various keywords that do not
correspond to extant categories.  Also bear in mind that the category
hierarchies will be different for different locales, once we have the
localization stuff in place.

> I think <dc:type/> should be used to store the categories in a more
> formal way. This could be done, for example, as such:
> <dc:type>
>   <rdf:Bag>
>     <rdf:li>Animals.Dogs</rdf:li>
>     <rdf:li>People.Men</rdf:li>
>   </rdf:Bag>
> </dc:type>

Then what happens when the categories change (due to more images being 
added to the collection and categories needing to be subdivided, for 
instance), or when we want to localize the collection?

No, the image itself would have this in its metadata:
<rdf:Bag>
    <rdf:li>animal</rdf:li>
    <rdf:li>mammal</rdf:li>
    <rdf:li>dog</rdf:li>
    <rdf:li>pet</rdf:li>
    <rdf:li>people</rdf:li>
    <rdf:li>man</rdf:li>
    [and possibly some others]
</rdf:Bag>

Then we have a hierarchy-en.xml or somesuch that does
something along these lines:
<category name="People">
    <keyword title="people" />
    <category name="Men"><keyword title="man" /></category>
    <category name="Women"><keyword title="woman" /></category>
</category>
<category name="Animals">
    <keyword title="animal" />
    <category name="Cats and Dogs">
       <keyword title="cat" />
       <keyword title="dog" />
    </category>
</category>

(We don't have this system entirely in place yet; but it is what we
are working toward.  What we have right now is a quick hack that
simulates this for English only.)

> And then <dc:subject/> could be used for keywords that are outside of
> the scope of categories.

Which keywords correspond to categories might differ, depending
on localization issues.  Also, which categories are subcategories
of which other categories is virtually guaranteed to differ.

> You may still want a defined list of keywords, but it could be
> extremely more expansive than the current list.

The list of keywords we are working from now is *mostly* based on
keywords that are actually used by some images in the collection
already.  A few were added in anticipation of images being
submitted with those keywords, but if numerous images are added
with a given keyword (unless it exactly duplicates the semantic of
a keyword we already have -- in which case we would unify them with
the authority control tool), we would add it to the list at some
point.  The upload facility provides some blanks that you can fill
in; any keyword at all can be specified, although we ask that
keywords be relevant to the content of the image in question.  So
the list is not rigidly defined.  We keep the list because it is
convenient to have around, not because those are the only keywords
we will accept.

Note that the keywords are intended mainly to be used in this
capacity, for organizing the collection.  There is a more
general-purpose description field that can be used for storing
text intended for human consumption, if desired.  Our authority
control process will probably unify the keywords irrespective
of language, using mainly English words as keywords, because that
will make collection management easier -- but we will be able to
use those keywords to generate categories in other languages,
by defining the necessary hierarchy XML for each langauge.
Indeed, we could localize differently for different countries
with the same language if their cultures are different enough
that this is deemed important.  I imagine that one unified
"English" package will be good enough for pretty much the
entire English-speaking world, but I can imagine that this
may not be the case for all languages.