[Clipart] clipart to cchost

tom whiteley tom.whiteley at gmail.com
Wed Feb 7 02:13:03 PST 2007


You've probably thought of this but you could use an md5sum on the
files to try and identify exact matches. Then store the md5sum's in a
database table for fast lookup.

This would probably be more effective for binary files rather than
svg, which can be different without actually altering the image
visually, (but you'd have the same problem for file size comparisons).
Of course you could convert to png first....

This should reduce the number of human reviews required.

Mind you, I've no idea if this is easy to implement in the current
framework or not, so just a suggestion.

Tom.

On 2/6/07, Jon Phillips <jon at rejon.org> wrote:
> On Thu, 2007-01-04 at 14:58 +0200, Nicu Buculei (OCAL) wrote:
> > momo wrote:
> > > I'm not quite sure that this is the simpliest way. In CCHOST there are
> > > still no thumbnails and no library browser (lots of thumbnails on one
> > > page) so chasing duplicates will be a very long and difficult task
> > > because it will require to open each single SVG file, and remember if
> > > this file resembles to something already listed in the library.
> > > I have already been in a situation like this in the everyday cleaning
> > > process I'm involved in, when an uploader (Machovka) reuploaded several
> > > duplicates of cliparts he uploaded a week before. I was lucky to
> > > remember that I already saw these cliparts before, and managed to find
> > > and delete the duplicates.
> > > Unfortunately CCHOST don't have instruments to easily locate duplicates
> > > (no thumbnails and no thumbnail browser).
> > > This is actually one of the reasons why I wanted to clean the 0.18
> > > collection (and entries that came after this release) locally (on my
> > > computer) before importing it to CCHOST. Other reasons are:
> > > - no need to reupload cleaned/improved files (uploading files to cchost
> > > is painfully slow and sometimes uploads even fail...)
> > > - faster cleaning (no need to browse the uploaded 0.18 collection
> > > online, just browse it locally)
> > > - less traffic on the server
> >
> > I think is a killer to visually try to identify and clean duplicates for
> > thousands of images, it should be done in a scripted way: compare the
> > file sizes and maybe names (names are not reliable) and do a visual
> > comparison only for the files with matching file sizes and/or names.
>
> Yes, I agree. I'm working on this task now...now that migration to
> ccHost has happened, working on import heavily...Also, we really want to
> have our old content online to move forward. This is very very
> important...especially, to search engines...which remind me I will need
> to make redirects from old content to new content...hmmm, fun with
> apache on that note ;)
>
> Jon
>
> --
> Jon Phillips
>
> San Francisco, CA
> USA PH 510.499.0894
> jon at rejon.org
> http://www.rejon.org
>
> MSN, AIM, Yahoo Chat: kidproto
> Jabber Chat: rejon at gristle.org
> IRC: rejon at irc.freenode.net
>
> _______________________________________________
> clipart mailing list
> clipart at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/clipart
>



More information about the clipart mailing list