[Clipart] Fix for file extension issue

Jonadab the Unsightly One jonadab at bright.net
Fri Jul 2 05:50:29 PDT 2004


Bryce Harrington <bryce at bryceharrington.com> writes:

> I was calling file parse like this:
>
>     my ($fn,$dir,$ext)  = fileparse($filename, '\..*?');
>
> which basically takes everything up to the first dot and calls that the
> filename, and everything after the first dot as the extension.  Which
> works fine for things like foo.svg, or (arguably) foo.tar.gz, but fails
> horribly on things like foo-1.2.3.svg.

This is a classically hard problem, because conventions aren't
consistent.  It's similar to the problem with sorting version numbers:
Which comes first, 3.12 or 3.2?  Of course, that depends whether 3.12
is three point one two or three point twelve.  (Then there are alpha
and beta tagged onto the ends of version numbers, plus preview
releases, numbered release candidates, ...)  As you point out,
multiple filename extensions have this complexity/ambiguity problem
too, albeit not to as bad an extent.

> A better regexp appears to be:
>
>     my ($fn,$dir,$ext)  = fileparse($filename, qr{\.[^\.]*?});
>
> so I'm switching to that (unless anyone has an idea for something
> better?

If you require the extension to have no periods in it, the most
significant loss is the original extension for anything gzipped;
though there are other cases as well, that's the one that will matter
most.  Of course the things most frequently gzipped are tarballs.
Maybe .tar.gz should be special-cased?  Of course then some clown will
gzip some other kind of file or use some other compressor with its own
secondary extension (e.g., bzip2).  If we wanted to be fancy we could
attempt to list common secondary extensions (gz, bz2, Z, ...) and keep
the preceding extension if the primary one is any of those.  That
might be more trouble than we really need to go to though.

I was already thinking about this issue, for the upload script, and
the conclusion I came to is that I can mostly avoid it, by not parsing
filenames.  I'm assigning a new filename based on the title metadatum
(as specified at upload time, or from the embedded metadata if one is
not specified at upload time) and ultimately will be deciding the
extension based on the user's choice from the filetype dropdown list.

The upload script right now is doing extension parsing for anything
other than SVG, but eventually it will keep the extension only for
files designated as the "other" type, using desinated extensions for
each of the other types, as it currently does for SVG (makes them all
into title_nn.svg).  In order to do away with the extension parsing
for tarballs, I have to make it autodetect whether they are gzipped;
it's already on my TODO list.

This leaves the case where the user inadvertently picks a filetype
inconsistent with the actual file uploaded, in which case we could get
a PNG image saved as foo_01.svg or somesuch, but no system is entirely
foolproof, because there's always a better fool out there somewhere.

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/





More information about the clipart mailing list