[Clipart] Metadata use and bug-fixes.

Francis Bond fcbond at gmail.com
Wed Oct 31 21:45:37 PDT 2007


G'day,

First let me start with a hearty thank you for everyone involved in
the openclipart project --- I am not at all good at graphics, and
having a source of usable illustrations is great.

My main interest is in word meaning and I am experimenting with adding
sense information to the clipart metadata.

For example, for the picture seal_sek_.svg, it would get a tag seal#9,
using the wordnet senses listed below:

# S: (n) sealing wax#1, seal#1 (fastener consisting of a resinous
composition that is plastic when warm; used for sealing documents and
parcels and letters)
# S: (n) seal#2, stamp#9 (a device incised to make an impression; used
to secure a closing or to authenticate documents)
# S: (n) seal#3, sealskin#1 (the pelt or fur (especially the underfur)
of a seal) "a coat of seal"
# S: (n) Navy SEAL#1, SEAL#4 (a member of a Naval Special Warfare unit
who is trained for unconventional warfare) "SEAL is an acronym for Sea
Air and Land"
# S: (n) seal#5 (a stamp affixed to a document (as to attest to its
authenticity or to seal it)) "the warrant bore the sheriff's seal"
# S: (n) cachet#1, seal#6, seal of approval#1 (an indication of
approved or superior status)
# S: (n) seal#7 (a finishing coat applied to exclude moisture)
# S: (n) seal#8 (fastener that provides a tight and perfect closure)
# S: (n) seal#9 (any of numerous marine mammals that come on shore to
breed; chiefly of cold regions)

(Actually the full tag would be more like wn-30-en-seal#n#9, showing
the wordnet version, language and part-of-speech).

To do this, I am basically mining the metadata, and comparing it to
wordnet --- seal_sek_.svg, has the title 'seal' and the tag 'animal'.
In wordnet:
seal#9  is-a pinniped mammal
        is-a aquatic mammal
        is-a placental
        is-a mammal
        is-a vertebrate
        is-a chordate
        is-a animal

So seal#9 is the best guess.

There are two main advantages of knowing the sense:
(a) we can associate the image with its hypernyms
 --- someone looking for "aquatic mammal" could find "seal"

(b) we can associate the image with wordnets in other languages
--- the seal#n#9 synset is linked to "phoque" in the French wordnet,
"海豹" in the Japanese wordnet, and so on.


I have a couple of questions concerning the data:

(1) there are several easily correctable bugs in the metadata
 + I think these could be corrected without a lot of review
 - empty tags
 - author/creator in tag

what is the best way to deal with these?
(1a) submit bug reports to bugzilla
(1b) fix the files and upload them somewhere
(1c) give you a perl script that finds and fixes the bugs

(2) there are some ways of automatically enhancing the metadata
 + I can fix these using wordnets, but they should be checked
 - title should be a tag (e.g. for "seal" above)
 - title is non-English (mainly Italian) and should be a tag

Again, I would like to know the best way to report/fix these

(3) I am looking at the data in release 0.18 (actually the Ubuntu
package openclipart-svg), where can I get the most recent
additions/corrections?

I would like to link as many images as possible, so would like to see
the newest collection.  Also,  I don't want to waste people's time by
suggesting corrections to things that have alrady been corrected.
Looking through the mailing list suggests that
http://download.openclipart.org/downloads/0.19/ is probably the set of
new images, is that right?

(4) I would be happy to help writing a perl script to run over new
uploads (or existing files) checking the metadata for known issues,
and maybe suggesting corrrections.  Would that be useful?  Is there
anyway to fit it into the current upload process as an automatic
check?

(5) Finally, I would like to propose officially allowing/recommending
wordnet sense tags in the tag metadata --- Is this the right place to
do it, or should I start a wiki page describing the proposal or what?

Sorry for the long mail out of the blue!

Looking forward to 0.19,

-- 
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Computational Linguistics Group



More information about the clipart mailing list