[Clipart] [Bug 3867] Bad character encoding for submited files

bugzilla-daemon at freedesktop.org bugzilla-daemon at freedesktop.org
Tue Aug 23 04:44:00 PDT 2005


Please do not reply to this email: if you want to comment on the bug, go to    
       
the URL shown below and enter yourcomments there.     
   
https://bugs.freedesktop.org/show_bug.cgi?id=3867          
     




------- Additional Comments From jonadab at bright.net  2005-08-23 04:44 -------
> I was under the impression this bug was somehow addressed

No, we back-burnered it until we got the hash bug fixed.

> But an example of an apparently different type of corruption
> of non-ASCII characters is unsorted/starwalker_wilc_.svg, which contains
>    étoile

*That* should no longer be happening, since Bryce's round of changes 
to Metadata.pm circa the OCAL 0.14 era or thereabouts.  The good thing
about it is, at least it's clear what the character is supposed to be.

Does SVGscan find these?

> OK, I found the files that Nicu was referring to. They didn't make it
> into release 0.16, presumably because XML::Twig couldn't parse them
> either. Here's one:
> 
>  http://www.openclipart.org/incoming-pre-0.16/corythosaurus_mois_s_rin_01.svg
> 
> The problem with these files is that the characters have been left in 
> Latin-1 instead of being converted to UTF-8.

Yes, that is the encoding bug that we back-burnered because it was not
impacting as many files as the HASH bug.  But the HASH bug is now fixed,
so it's time to look at this one again.

I suspect most or all of the files with this bug have probably ended up in
the failed-files archive for the 0.16 release.  There will also be some in
the 0.17 failed files archive, no doubt, and for every release until we
fix the problem.  However, the 0.16 failed files archive should include
all the ones from 0.13 through 0.16, and I think the ones from before 0.13
have all been repaired and reprocessed by this point, so there's no need
to go back any further than 0.16 I think.

The failed files from 0.16 can be found here:
http://openclipart.org/downloads/0.16/openclipart-0.16-failed.zip

I believe I know, too, what causes this bug:  the web browser sends the
form contents in an encoding that is neither US-ASCII nor UTF-8, but some
other one (Latin-1 being a good example), but when XML::Twig parses the
submitted file, it makes the encoding UTF-8.  What we need to do, probably
in getforminput, is convert the added metadata also to UTF-8, before it
is inserted into the metadata object.  Someone told me that I should
look at the Encode module, the POD for which can be viewed here:

http://search.cpan.org/~dankogai/Encode-2.11/Encode.pm

However, this documentation is hairy and scary and contains stuff like this:
> CAVEAT: When you run $octets = encode("utf8", $string), then $octets may
> not be equal to $string. Though they both contain the same data, the utf8
> flag for $octets is always off. When you encode anything, utf8 flag of 
> the result is always off, even when it contains completely valid utf8
> string. See "The UTF-8 flag" below.

So it is obvious to me that anything I do with this module will need to be
tested thoroughly before we deploy it on the site.  However, I do not have
the ability to test it here, because I don't have a unicode keyboard; I
don't have a way, as far as I am aware, to type non-ASCII characters.
And if I did, I wouldn't know what I was doing.

So it has occurred to me now that what we really need to do is put up
a separate, "testing" version of the upload script, on the site, set
up to write its "uploads" to a separate directory from the main one,
so we don't get them mixed up, and to use a different file for the
upload input log.  This I can do.  Then we can play with encoding stuff
using the testing script until we figure out how to make it do what we
want, and at that point all we have to do is migrate the changes over
to the main upload script and Bob will be our uncle.

I'll call the testing script upload_test.cgi, and I'll try to get it
up this week, and we can go from there.          
     
     
--           
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email         
     
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the clipart mailing list