[Clipart] question about releases

Wolfgang Spraul wolfgang at fabricatorz.com
Sun Apr 7 20:17:10 PDT 2013


Francis,

On Sun, Mar 24, 2013 at 09:11:32PM +0800, Francis Bond wrote:
> Actually, I am using the OCAL images plus tags as data for an
> assignment for my students.  If I could get the full meta-data soon (I
> actually gave them the assignment last week, but they have four weeks
> in all) they could match with it which would make the task much more
> rewarding.

I just posted another adhoc release where all 41k+ graphics have their
metadata updated to reflect the tags we have in the database.

http://openclipart.org/adhoc_release_all_svgs_2013-04-07.tar.bz2
1,337,143,578 bytes, md5sum c19df9a4...

*) converted about 1500-2000 graphics from ISO-8859-1 to UTF-8. So
going forward I think we should say that all files in the library
are UTF-8

*) resolved DTD-entities in about 1500-2000 files. Those are cases
where the namespace is, for example "&ns_svg;", referencing an
ENTITY. If the namespace was used, I could not open the file in
Inkscape. Ran xmllint --noent to resolve references.

*) manually fixed XML issues in about 50-100 files, deleted some
others that were invalid or partial uploads.

*) the old (now overwritten) metadata was preserved in files with
.upload-metadata extension, for example
http://openclipart.org/people/rejon/rejon_Supergirl.svg.upload-metadata
I also created a tarball with all old metadata in it, just in case.
If I hear nothing much back about the metadata, I will delete the
.upload-metadata files in a week or so.

Going forward the syncing between database tags and .svg files is not
yet automated, but we can run the update script every month or so.
I think it's definitely worth our time to work a bit on improving
the tags now, if more librarians feel motivated - please do so.

http://openclipart.org/tags/clipart_issue
http://openclipart.org/listnotags
http://openclipart.org/listnodescription

- case in filenames
I found 74 (=37*2) files where the filename differs only in case.
See
http://openclipart.org/filenames_with_case_diff.txt

These files are already triggering bugs in our mysql processing,
and they would most likely cause trouble on some Windows systems
as well. Maybe going forward we should adopt a policy to not
allow multiple uploads by the same user where the only difference
in filename is case? I will go through these 74 files to see whether
they are duplicates, then either pick a winner or rename the second
one to _2 or so.

- case in tags
There are about 33,000 different tags in use. What do people think
about lower-casing all A-Z characters in the tags? That way "Car"
and "car" would become the same tag.

- xml
Do we have some xml experts here who have preferences wrt DTD,
inkscape/sodipopdi/adobe namespaces, etc?
Some files are using a DTD, some don't. There are tools such
as xmlstarlet, xmllint, scour, or even Inkscape we could use
to cleanup/standardize the xml.

So much for today, enjoy springtime.


More information about the clipart mailing list