[Clipart] character coding

Stephen Silver ocalocal at btinternet.com
Mon Feb 7 02:57:44 PST 2005


Jonadab the Unsightly One wrote:

> More to the point, how can the script detect what encoding the
> information it's receiving is encoded in, short of asking the user?

Where is the upload script receiving the RDF data from?  My impression
of the way things are supposed to work was that the script first
writes the file (unchanged) to disk, then attempts to read it using
SVG::Metadata, which in turn uses XML::Twig, which ought to detect the
character encoding (probably using the procedure outlined in Appendix F
of the XML spec) and return everything in UTF-8.  So your script
shouldn't need to worry about about the encoding, as it should only
ever see UTF-8, and everything should work.

What is actually happening is that UTF-8 characters in the uploaded file
are being converted to ISO 8859-1 (or maybe ISO 8859-15).  Does anyone
have any idea where this is happening?

> Alternatively:  does RDF allow for non-ASCII characters in the
> metadata to be encoded as entities?  Could we just use something along
> the lines of HTML::Entities to encode it (so that e.g. the problematic
> character in the file in question would become é or somesuch)?

I'm not sure that you can use é in XML, but you can use a numeric
character reference (in this case é or é).

> Wouldn't that render the character encoding basically irrelevant?

It would, but it presupposes that whatever is going to write these
numeric character references knows what encoding it is currently
storing the characters in.  If the data is always in UTF-8 at this
point, then there is no problem.

-- 
Stephen Silver




More information about the clipart mailing list