[Clipart] Is anyone working on categorizing the existing images?

Bryce Harrington bryce at bryceharrington.com
Tue Jun 8 22:16:38 PDT 2004


On Tue, 8 Jun 2004, Jonadab the Unsightly One wrote:

> Bryce Harrington <bryce at bryceharrington.com> writes:
> 
> > By 'framework' what do you think we need?  
> 
> Some mechanism (possibly a web-based thingy) whereby we can look at a
> tree of existing categories, browse through the uncategorized images,
> and designate categories for them.
> 
> > There's three items we had identified as necessary in prior
> > discussions - one is a metadata format, and IIRC we identified XMP
> > (which builds on Dublin Core, RDF, etc.)  
> 
> To me, it doesn't matter how this information is stored, as long as
> it's possible to tell what categories any given image is in, construct
> a tree of the categories that contain images, and get a list of the
> images and subcategories in any given category.

Got it:

We need a mechanism (including web-based) that allows:
   * Looking through tree of existing categories
   * Browse uncategorized images
   * Designate categories for images
   * Generate XMP for an item
   * View the categories for a given image
   * For a given category get a list of images and subcategories

> > The other is a way to ensure the appropriate XML snippets get
> > generated and added to the item during upload.  
> 
> Are you talking about having the person who submits the image give it
> a tentative category, or just marking it as uncategorized?  (I can go
> either way on that...)

Yes, I was thinking of allowing the submitter to designate a 'tentative
category'.  Or categories.

Perhaps the ideal would be to give them a list of checkboxes of
available categories to choose from, with a "fill in the blank" at the
bottom.  Or maybe that'd turn into too many checkboxes...  Maybe provide
some sort of navigational system for assigning increasingly finer
subcategories.  Hmm.  Ideas?

> I think we have to realize that categories are going to develop new
> subcategories as images are submitted.  It also seems highly likely
> that some images and even subcategories will belong in multiple
> categories.  It is not difficult to imagine ten or twenty images of
> cooked turkeys being put in a category together, and having that
> category (Turkeys-Cooked) listed under both Food/Meat and
> Holidays/Thanksgiving.  Or whatever.  I do think the index approach
> lends itself well to this; Yahoo has tons of crosslinking.

Yup, agreed.

> > So, in thinking about how their findings would apply to the Open
> > Clip Art Library, our categories would be more like "keywords" that
> > can be created and used as needed.  The cost is that we would need
> > people to review keywords that are chosen and adjust the content as
> > needed so things "match up" (so we don't end up with categories of
> > "Fruit" "Fruits" "Fruits & Vegetables" "Vegetables and Fruits",
> > etc.)  
> 
> In libraries this process is called Authority Control, and yeah, it's
> definitely going to be necessary at some point, though probably not
> right away.  If we can get a heirarchical tree of the existing
> categories, then that will make it easier to decide what to adjust.

Soudns good.

> If each category is a keyword (say, for the example above,
> turkeys-cooked is a keyword that can be attached to images of cooked
> turkeys), then the other thing we need in order to construct a
> heirarchy is the ability to take an existing category keyword and
> attach supercategory keywords to it -- that is, we might take the
> turkeys-cooked category and attach both the thanksgiving keyword and
> the meat (or maybe maindish) keyword to it.  Then we'd attach the
> holidays keyword to the thanksgiving category and the food keyword to
> the meat (or maindish) category.  Am I making any sense?

Oh, supercategories -- interesting idea.

Use Case
   1.  Several cooked turkey images are uploaded
   2.  The turkey images are assigned various keywords:
       2 turkeys have keywords = "Thanksgiving"
       1 turkey have keywords = "Holiday"
       3 turkeys have keywords = "Food"
       2 turkeys have keywords = "foods"
       2 turkeys have keywords = "Food" and "Holiday"
   3.  User identifies "Holiday" as a supercategory of "Thanksgiving"
       System adjusts:
       3 turkeys have keywords = "Holiday::Thanksgiving"
       3 turkeys have keywords = "Food"
       2 turkeys have keywords = "foods"
       2 turkeys have keywords = "Food" and "Holiday::Thanksgiving"
   4.  User specifies "foods=>Food"
       3 turkeys have keywords = "Holiday::Thanksgiving"
       5 turkeys have keywords = "Food"
       2 turkeys have keywords = "Food" and "Holiday::Thanksgiving"
   5.  User adds "Food" and "Holiday::Thanksgiving" keywords for all
       turkeys.  So:
       10 turkeys have keywords = "Food" and "Holiday::Thanksgiving"

> OTOH, this is not a very complicated wheel.  We're talking about a
> database with two tables.  The one table has records for all the
> images, with a field that uniquely identifies the image in the
> collection (a URI will do), whatever other metadata you want (author
> or source, image file format (e.g., SVG), bw/greyscale/indexed/color,
> whatever), and a categories field where one or more category keywords
> can be put.  The other table you need is for the categories themselves
> and contains the unique identified (keyword), metadata (description,
> synonyms, ...), and a categories field listing categories it belongs
> to.  If you want to get fancy and go both ways you could also have a
> subcategories field listing categories that belong to it, but then any
> code that modifies the category keywords has to change both places.
> (Also, that's the sort of thing that can be retrofitted later.  We
> don't need to link both ways just to get started.)

So sounds like something like this:

CREATE TABLE image (
    id              INT NOT NULL AUTO_INCREMENT,
    uri             VARCHAR(255),
    author          VARCHAR(255),
    source          VARCHAR(255),
    format_id       INT
);

CREATE TABLE format (
    id              INT NOT NULL AUTO_INCREMENT,
    name            VARCHAR(255)
);

INSERT INTO format (id, name) VALUES 
( 1, "svg" ),
( 2, "jpg" ),
( 3, "wmf" );

CREATE TABLE category (
    id              INT NOT NULL AUTO_INCREMENT,
    name            VARCHAR(255),
    description     TEXT
);

CREATE TABLE category_to_image (
    category_id         INT,
    image_id            INT
);

CREATE TABLE category_inheretance (
    category_id         INT,
    supercategory_id    INT
);

CREATE TABLE category_synonym (
    category_id         INT,
    synonym_category_id INT
);

> With a simple db library like Class::DBI, this could probably be
> tossed together well enough to start using it in a couple of hours.

I've used DBI quite a bit but not Class::DBI.  If you got the ball
rolling, though, I'm game to give it a go.

> The existing upload facility would need to be modified to create a
> record in the db for every item uploaded, and we'd need a list of the
> existing already-uploaded ones in order to create records for those.
> The records could be created initially with no category keywords, and
> the categories table could start empty, and the same script that adds
> a keyword to an image's record could also create the category record
> if it does not exist already.

Sounds good.

> Given that there are only a couple of quite simple tables in the
> database, it could even be flat files, or maybe DBD::SQLite.

Nah, I've talked with the freedesktop admin about db's and they said a
mysql db would be no prob to set up (in fact, it's on my todo list to
give him the info to set up one for a bug tracker for us.)

Bryce




More information about the clipart mailing list