[Clipart] [Bug 19449] New: Ridiculous number of duplicates

Francis Bond fcbond at gmail.com
Wed Jan 7 16:44:01 PST 2009


G'day,

> Created an attachment (id=21767)
>  --> (http://bugs.freedesktop.org/attachment.cgi?id=21767)
> Listing of cuplicate files
>
> I installed the Windows version of the library and noticed a lot of duplicates
> when adding them to the OpenOffice.org Gallery.  I then downloaded
> openclipart-0.18-full.tar.bz2 and did a duplicate check (on Ubuntu) and found
> thousands.  The commands I used are below and were run from the clipart
> directory:
>
> find . -type f -exec md5sum '{}' \; >md5_listing.txt
> sort md5_listing.txt | uniq -d -w32 | cut -c 1-32 >md5_duplicates.txt
> grep -f md5_duplicates.txt md5_listing.txt | sort >duplicates_listing.txt

A lot of the files in your list  are in fact symbolic links.  For example:

019b0629e40ea6235500c601e1fbbb01  ./decorations/sakura_01.svg
019b0629e40ea6235500c601e1fbbb01  ./plants/flowers/sakura_01.svg
019b0629e40ea6235500c601e1fbbb01  ./plants/sakura_01.svg

ls -l plants/sakura* plants/flowers/sakura_* decorations/sakura*

lrwxrwxrwx 1 root root    23 2008-04-22 12:36
decorations/sakura_01.png -> ../plants/sakura_01.png
lrwxrwxrwx 1 root root    16 2008-04-22 12:36
plants/flowers/sakura_01.png -> ../sakura_01.png
lrwxrwxrwx 1 root root    26 2008-04-22 12:36
plants/flowers/sakura_dave_pena_01.png -> ../sakura_dave_pena_01.png
-rw-r--r-- 1 root root 45074 2008-01-07 17:59 plants/sakura_01.png
-rw-r--r-- 1 root root 64197 2008-01-07 17:59 plants/sakura_dave_pena_01.png

If you ignore these, then I believe that there are no duplicates.  Did
you perhaps convert all of your symbolic links to actual files?

I get no duplicates at all with your script, 2,234 if I include
symbolic links (i.e. omit the '-type f' option to find).

-- 
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group



More information about the clipart mailing list