[Clipart] clipart to cchost

Jon Phillips jon at rejon.org
Wed Feb 7 10:20:26 PST 2007


On Wed, 2007-02-07 at 10:13 +0000, tom whiteley wrote:
> You've probably thought of this but you could use an md5sum on the
> files to try and identify exact matches. Then store the md5sum's in a
> database table for fast lookup.

Yes, this is a great idea for algorithmic search.

So, we need a duplicate searcher plugin for ccHost. I think this is hte
algorithm:

1.) search for same names
2.) search files by md5
3.) all duplicates are flagged for human review or acted upon

If there is a duplicate, I wonder if we should link the new one to the
old one, or just keep the old one. I think keeping the old one is the
best method, as we want newer ones to be modifications. If we can make a
decision, than we can automate the algorithmic search and action rather
than have human review. This should leave little room for ambiguity.

> This would probably be more effective for binary files rather than
> svg, which can be different without actually altering the image
> visually, (but you'd have the same problem for file size comparisons).
> Of course you could convert to png first....

I think this is a smart approach.

> This should reduce the number of human reviews required.

Agree!

> Mind you, I've no idea if this is easy to implement in the current
> framework or not, so just a suggestion.

It is easily doable. The way to do IMO is to make an admin feature for
this type of detection to be run across the whole site.

Great idea! Now, we just have to do it...btw, did you get ccHost
installed?

JON

> Tom.
> 
> On 2/6/07, Jon Phillips <jon at rejon.org> wrote:
> > On Thu, 2007-01-04 at 14:58 +0200, Nicu Buculei (OCAL) wrote:
> > > momo wrote:
> > > > I'm not quite sure that this is the simpliest way. In CCHOST there are
> > > > still no thumbnails and no library browser (lots of thumbnails on one
> > > > page) so chasing duplicates will be a very long and difficult task
> > > > because it will require to open each single SVG file, and remember if
> > > > this file resembles to something already listed in the library.
> > > > I have already been in a situation like this in the everyday cleaning
> > > > process I'm involved in, when an uploader (Machovka) reuploaded several
> > > > duplicates of cliparts he uploaded a week before. I was lucky to
> > > > remember that I already saw these cliparts before, and managed to find
> > > > and delete the duplicates.
> > > > Unfortunately CCHOST don't have instruments to easily locate duplicates
> > > > (no thumbnails and no thumbnail browser).
> > > > This is actually one of the reasons why I wanted to clean the 0.18
> > > > collection (and entries that came after this release) locally (on my
> > > > computer) before importing it to CCHOST. Other reasons are:
> > > > - no need to reupload cleaned/improved files (uploading files to cchost
> > > > is painfully slow and sometimes uploads even fail...)
> > > > - faster cleaning (no need to browse the uploaded 0.18 collection
> > > > online, just browse it locally)
> > > > - less traffic on the server
> > >
> > > I think is a killer to visually try to identify and clean duplicates for
> > > thousands of images, it should be done in a scripted way: compare the
> > > file sizes and maybe names (names are not reliable) and do a visual
> > > comparison only for the files with matching file sizes and/or names.
> >
> > Yes, I agree. I'm working on this task now...now that migration to
> > ccHost has happened, working on import heavily...Also, we really want to
> > have our old content online to move forward. This is very very
> > important...especially, to search engines...which remind me I will need
> > to make redirects from old content to new content...hmmm, fun with
> > apache on that note ;)
> >
> > Jon
> >
> > --
> > Jon Phillips
> >
> > San Francisco, CA
> > USA PH 510.499.0894
> > jon at rejon.org
> > http://www.rejon.org
> >
> > MSN, AIM, Yahoo Chat: kidproto
> > Jabber Chat: rejon at gristle.org
> > IRC: rejon at irc.freenode.net
> >
> > _______________________________________________
> > clipart mailing list
> > clipart at lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/clipart
> >
-- 
Jon Phillips

San Francisco, CA
USA PH 510.499.0894
jon at rejon.org
http://www.rejon.org

MSN, AIM, Yahoo Chat: kidproto
Jabber Chat: rejon at gristle.org
IRC: rejon at irc.freenode.net




More information about the clipart mailing list