[Clipart] [Bug 19449] New: Ridiculous number of duplicates
Francis Bond
fcbond at gmail.com
Wed Jan 7 16:44:01 PST 2009
G'day,
> Created an attachment (id=21767)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=21767)
> Listing of cuplicate files
>
> I installed the Windows version of the library and noticed a lot of duplicates
> when adding them to the OpenOffice.org Gallery. I then downloaded
> openclipart-0.18-full.tar.bz2 and did a duplicate check (on Ubuntu) and found
> thousands. The commands I used are below and were run from the clipart
> directory:
>
> find . -type f -exec md5sum '{}' \; >md5_listing.txt
> sort md5_listing.txt | uniq -d -w32 | cut -c 1-32 >md5_duplicates.txt
> grep -f md5_duplicates.txt md5_listing.txt | sort >duplicates_listing.txt
A lot of the files in your list are in fact symbolic links. For example:
019b0629e40ea6235500c601e1fbbb01 ./decorations/sakura_01.svg
019b0629e40ea6235500c601e1fbbb01 ./plants/flowers/sakura_01.svg
019b0629e40ea6235500c601e1fbbb01 ./plants/sakura_01.svg
ls -l plants/sakura* plants/flowers/sakura_* decorations/sakura*
lrwxrwxrwx 1 root root 23 2008-04-22 12:36
decorations/sakura_01.png -> ../plants/sakura_01.png
lrwxrwxrwx 1 root root 16 2008-04-22 12:36
plants/flowers/sakura_01.png -> ../sakura_01.png
lrwxrwxrwx 1 root root 26 2008-04-22 12:36
plants/flowers/sakura_dave_pena_01.png -> ../sakura_dave_pena_01.png
-rw-r--r-- 1 root root 45074 2008-01-07 17:59 plants/sakura_01.png
-rw-r--r-- 1 root root 64197 2008-01-07 17:59 plants/sakura_dave_pena_01.png
If you ignore these, then I believe that there are no duplicates. Did
you perhaps convert all of your symbolic links to actual files?
I get no duplicates at all with your script, 2,234 if I include
symbolic links (i.e. omit the '-type f' option to find).
--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group
More information about the clipart
mailing list