[Clipart] Is anyone working on categorizing the existing images?

Tue Jun 8 08:06:46 PDT 2004

Bryce Harrington <bryce at bryceharrington.com> writes:

> By 'framework' what do you think we need?  

Some mechanism (possibly a web-based thingy) whereby we can look at a
tree of existing categories, browse through the uncategorized images,
and designate categories for them.

> There's three items we had identified as necessary in prior
> discussions - one is a metadata format, and IIRC we identified XMP
> (which builds on Dublin Core, RDF, etc.)  

To me, it doesn't matter how this information is stored, as long as
it's possible to tell what categories any given image is in, construct
a tree of the categories that contain images, and get a list of the
images and subcategories in any given category.

> The other is a way to ensure the appropriate XML snippets get
> generated and added to the item during upload.  

Are you talking about having the person who submits the image give it
a tentative category, or just marking it as uncategorized?  (I can go
either way on that...)

> The third is the procedure for determining and identifying categories.
> The approach I've advocated would try to mirror Wikipedia, which has
> been extraordinarily successful at collecting and categorizing a huge
> amount of encyclopedia content.  The essence for us to take is that
> rather than inserting everything into a pre-established hierarchy, they
> use an essentially "flat" file storage approach, and then map "index"
> pages on top of it.  The advantage here is that in reality the data is
> relational, not hierarchical, and forcing a hierarchy on it up front
> would require a lot of contention and divisiveness arguing for one
> approach vs. another.  It's weird that so much order arises from
> something anarchical, but it seems to work okay.

I think we have to realize that categories are going to develop new
subcategories as images are submitted.  It also seems highly likely
that some images and even subcategories will belong in multiple
categories.  It is not difficult to imagine ten or twenty images of
cooked turkeys being put in a category together, and having that
category (Turkeys-Cooked) listed under both Food/Meat and
Holidays/Thanksgiving.  Or whatever.  I do think the index approach
lends itself well to this; Yahoo has tons of crosslinking.

> So, in thinking about how their findings would apply to the Open
> Clip Art Library, our categories would be more like "keywords" that
> can be created and used as needed.  The cost is that we would need
> people to review keywords that are chosen and adjust the content as
> needed so things "match up" (so we don't end up with categories of
> "Fruit" "Fruits" "Fruits & Vegetables" "Vegetables and Fruits",
> etc.)  

In libraries this process is called Authority Control, and yeah, it's
definitely going to be necessary at some point, though probably not
right away.  If we can get a heirarchical tree of the existing
categories, then that will make it easier to decide what to adjust.

> However, I think this approach would give us the flexibility and
> benefit of being able to handle huge amounts of content without
> requiring any sort of "Central Category Decision Committee".

Agreed.  We can have a Decentralized Category Decision Committee ;-)

> Then, we would need something that is the analog of the "index" pages,
> which I surmise would be implemented as "category trees" that anyone can
> assemble.  These lists would simply specify some form of organizational
> structure that hooks in content based on the keywords, and perhaps using
> other metadata properties such as author name, etc. as additional
> keywords or filters.  

If each category is a keyword (say, for the example above,
turkeys-cooked is a keyword that can be attached to images of cooked
turkeys), then the other thing we need in order to construct a
heirarchy is the ability to take an existing category keyword and
attach supercategory keywords to it -- that is, we might take the
turkeys-cooked category and attach both the thanksgiving keyword and
the meat (or maybe maindish) keyword to it.  Then we'd attach the
holidays keyword to the thanksgiving category and the food keyword to
the meat (or maindish) category.  Am I making any sense?

There are other possible approaches.  IMO it doesn't matter which
approach we take under the hood, as long as it's manageable and allows
the desired information to be constructed and extracted.

> I know there are other systems out there like this.  It is not
> unlikely that there is something we can reuse or borrow from in some
> fashion.  I mentioned Wikipedia already.  Dmoz is another content
> indexing system to consider.  

OTOH, this is not a very complicated wheel.  We're talking about a
database with two tables.  The one table has records for all the
images, with a field that uniquely identifies the image in the
collection (a URI will do), whatever other metadata you want (author
or source, image file format (e.g., SVG), bw/greyscale/indexed/color,
whatever), and a categories field where one or more category keywords
can be put.  The other table you need is for the categories themselves
and contains the unique identified (keyword), metadata (description,
synonyms, ...), and a categories field listing categories it belongs
to.  If you want to get fancy and go both ways you could also have a
subcategories field listing categories that belong to it, but then any
code that modifies the category keywords has to change both places.
(Also, that's the sort of thing that can be retrofitted later.  We
don't need to link both ways just to get started.)

With a simple db library like Class::DBI, this could probably be
tossed together well enough to start using it in a couple of hours.

The existing upload facility would need to be modified to create a
record in the db for every item uploaded, and we'd need a list of the
existing already-uploaded ones in order to create records for those.
The records could be created initially with no category keywords, and
the categories table could start empty, and the same script that adds
a keyword to an image's record could also create the category record
if it does not exist already.

Given that there are only a couple of quite simple tables in the
database, it could even be flat files, or maybe DBD::SQLite.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/