[Clipart] Is anyone working on categorizing the existing images?

Bruno Coudoin bruno.coudoin at free.fr
Tue Jun 8 08:59:29 PDT 2004


I don't want to add unecessary complexity to your project but I did not 
see how you adress I18N. It makes sense for users to enter images with 
categorisation in english but at some point, it would be nice if end 
users could search the database in their language.

As for the choice between strong and weak categorisation, i would prefer 
weak. The way I see it is that categories are just keywords. This let 
the user create a 'package' of images based on keywords, and cross 
strong category boundaries. If I search farm, I'd like to see 'farm 
animals', 'farm tools' and 'farm houses'. Weak category is also much 
easier to administer, you need no central authority.

I repeat it here but the work done in assetml 
(http://www.ofset.org/assetml) addresses I believe most of the issues 
here. Please have a close analysis of it.

One weakness of the database, as you propose here, is that you cannot 
install it easily and distribute it with a GNU/Linux distrib.

On the other end, an XML file format is easy to parse by any tools.

Bruno.

Jonadab the Unsightly One wrote:

>Bryce Harrington <bryce at bryceharrington.com> writes:
>
>  
>
>>By 'framework' what do you think we need?  
>>    
>>
>
>Some mechanism (possibly a web-based thingy) whereby we can look at a
>tree of existing categories, browse through the uncategorized images,
>and designate categories for them.
>
>  
>
>>There's three items we had identified as necessary in prior
>>discussions - one is a metadata format, and IIRC we identified XMP
>>(which builds on Dublin Core, RDF, etc.)  
>>    
>>
>
>To me, it doesn't matter how this information is stored, as long as
>it's possible to tell what categories any given image is in, construct
>a tree of the categories that contain images, and get a list of the
>images and subcategories in any given category.
>
>  
>
>>The other is a way to ensure the appropriate XML snippets get
>>generated and added to the item during upload.  
>>    
>>
>
>Are you talking about having the person who submits the image give it
>a tentative category, or just marking it as uncategorized?  (I can go
>either way on that...)
>
>  
>
>>The third is the procedure for determining and identifying categories.
>>The approach I've advocated would try to mirror Wikipedia, which has
>>been extraordinarily successful at collecting and categorizing a huge
>>amount of encyclopedia content.  The essence for us to take is that
>>rather than inserting everything into a pre-established hierarchy, they
>>use an essentially "flat" file storage approach, and then map "index"
>>pages on top of it.  The advantage here is that in reality the data is
>>relational, not hierarchical, and forcing a hierarchy on it up front
>>would require a lot of contention and divisiveness arguing for one
>>approach vs. another.  It's weird that so much order arises from
>>something anarchical, but it seems to work okay.
>>    
>>
>
>I think we have to realize that categories are going to develop new
>subcategories as images are submitted.  It also seems highly likely
>that some images and even subcategories will belong in multiple
>categories.  It is not difficult to imagine ten or twenty images of
>cooked turkeys being put in a category together, and having that
>category (Turkeys-Cooked) listed under both Food/Meat and
>Holidays/Thanksgiving.  Or whatever.  I do think the index approach
>lends itself well to this; Yahoo has tons of crosslinking.
>
>  
>
>>So, in thinking about how their findings would apply to the Open
>>Clip Art Library, our categories would be more like "keywords" that
>>can be created and used as needed.  The cost is that we would need
>>people to review keywords that are chosen and adjust the content as
>>needed so things "match up" (so we don't end up with categories of
>>"Fruit" "Fruits" "Fruits & Vegetables" "Vegetables and Fruits",
>>etc.)  
>>    
>>
>
>In libraries this process is called Authority Control, and yeah, it's
>definitely going to be necessary at some point, though probably not
>right away.  If we can get a heirarchical tree of the existing
>categories, then that will make it easier to decide what to adjust.
>
>  
>
>>However, I think this approach would give us the flexibility and
>>benefit of being able to handle huge amounts of content without
>>requiring any sort of "Central Category Decision Committee".
>>    
>>
>
>Agreed.  We can have a Decentralized Category Decision Committee ;-)
>
>  
>
>>Then, we would need something that is the analog of the "index" pages,
>>which I surmise would be implemented as "category trees" that anyone can
>>assemble.  These lists would simply specify some form of organizational
>>structure that hooks in content based on the keywords, and perhaps using
>>other metadata properties such as author name, etc. as additional
>>keywords or filters.  
>>    
>>
>
>If each category is a keyword (say, for the example above,
>turkeys-cooked is a keyword that can be attached to images of cooked
>turkeys), then the other thing we need in order to construct a
>heirarchy is the ability to take an existing category keyword and
>attach supercategory keywords to it -- that is, we might take the
>turkeys-cooked category and attach both the thanksgiving keyword and
>the meat (or maybe maindish) keyword to it.  Then we'd attach the
>holidays keyword to the thanksgiving category and the food keyword to
>the meat (or maindish) category.  Am I making any sense?
>
>There are other possible approaches.  IMO it doesn't matter which
>approach we take under the hood, as long as it's manageable and allows
>the desired information to be constructed and extracted.
>
>  
>
>>I know there are other systems out there like this.  It is not
>>unlikely that there is something we can reuse or borrow from in some
>>fashion.  I mentioned Wikipedia already.  Dmoz is another content
>>indexing system to consider.  
>>    
>>
>
>OTOH, this is not a very complicated wheel.  We're talking about a
>database with two tables.  The one table has records for all the
>images, with a field that uniquely identifies the image in the
>collection (a URI will do), whatever other metadata you want (author
>or source, image file format (e.g., SVG), bw/greyscale/indexed/color,
>whatever), and a categories field where one or more category keywords
>can be put.  The other table you need is for the categories themselves
>and contains the unique identified (keyword), metadata (description,
>synonyms, ...), and a categories field listing categories it belongs
>to.  If you want to get fancy and go both ways you could also have a
>subcategories field listing categories that belong to it, but then any
>code that modifies the category keywords has to change both places.
>(Also, that's the sort of thing that can be retrofitted later.  We
>don't need to link both ways just to get started.)
>
>With a simple db library like Class::DBI, this could probably be
>tossed together well enough to start using it in a couple of hours.
>
>The existing upload facility would need to be modified to create a
>record in the db for every item uploaded, and we'd need a list of the
>existing already-uploaded ones in order to create records for those.
>The records could be created initially with no category keywords, and
>the categories table could start empty, and the same script that adds
>a keyword to an image's record could also create the category record
>if it does not exist already.
>
>Given that there are only a couple of quite simple tables in the
>database, it could even be flat files, or maybe DBD::SQLite.
>
>  
>






More information about the clipart mailing list