[Clipart] Is anyone working on categorizing the existing images?

Bryce Harrington bryce at bryceharrington.com
Sat Jun 5 12:17:26 PDT 2004


On Sat, 5 Jun 2004, Jonadab the Unsightly One wrote:

Hi Jonadab!

> Is anyone working on developing working categories and subcategories
> for the images that exist now?  I'm thinking if we collect a whole ton
> of images, dividing them all into categories later will be a major
> undertaking; if we categorize them as we go, it may be easier.  This
> is something I'd potentially be willing to work on, if there were a
> framework in place for it.

Yes that's definitely true.  If something can be put in place now, it'll
save a ton of hassle later on.  There has been much discussion about
categorization and how to do it, but I think what we really need right
now is to just start getting the existing stuff categorized.  Nobody has
taken on that task yet, but it definitely feels like now is the time to
do it.

By 'framework' what do you think we need?  There's three items we had 
identified as necessary in prior discussions - one is a metadata format,
and IIRC we identified XMP (which builds on Dublin Core, RDF, etc.)  The
other is a way to ensure the appropriate XML snippets get generated and
added to the item during upload.  I think someone had started working on
a PHP script to do this a few weeks ago but am not sure of the current
status of it.  

The third is the procedure for determining and identifying categories.
The approach I've advocated would try to mirror Wikipedia, which has
been extraordinarily successful at collecting and categorizing a huge
amount of encyclopedia content.  The essence for us to take is that
rather than inserting everything into a pre-established hierarchy, they
use an essentially "flat" file storage approach, and then map "index"
pages on top of it.  The advantage here is that in reality the data is
relational, not hierarchical, and forcing a hierarchy on it up front
would require a lot of contention and divisiveness arguing for one
approach vs. another.  It's weird that so much order arises from
something anarchical, but it seems to work okay.

So, in thinking about how their findings would apply to the Open Clip
Art Library, our categories would be more like "keywords" that can be
created and used as needed.  The cost is that we would need people to
review keywords that are chosen and adjust the content as needed so
things "match up" (so we don't end up with categories of "Fruit"
"Fruits" "Fruits & Vegetables" "Vegetables and Fruits", etc.)  However,
I think this approach would give us the flexibility and benefit of being
able to handle huge amounts of content without requiring any sort of
"Central Category Decision Committee".

Then, we would need something that is the analog of the "index" pages,
which I surmise would be implemented as "category trees" that anyone can
assemble.  These lists would simply specify some form of organizational
structure that hooks in content based on the keywords, and perhaps using
other metadata properties such as author name, etc. as additional
keywords or filters.  Perhaps it could be as simple as a text file with
each row a key/value pair where the key is a category path and the value
is a query statement (such as 'keyword is Fruit'.)  These indexes could
then be put on the web and shared with others, with shared editing
rights as in Wiki.  We would also want a mechanism to #include or href
from one index to another, to permit creation of topical indexes as well
as higher level general indexes.

I know there are other systems out there like this.  It is not unlikely
that there is something we can reuse or borrow from in some fashion.
I mentioned Wikipedia already.  Dmoz is another content indexing system
to consider.  Amazon.com's Listmania feature shows how individual users
can set up their own lists based on particular topics, with relations to
other user's lists.  And there's plenty of other relationship-management
software out there.  So the basic idea here is pretty well established,
our trick would be to find a way to capture the essence of the
capability in an easy manner that doesn't get too complex or hard to
manage.  

Bryce





More information about the clipart mailing list