[Clipart] character coding

Jonadab the Unsightly One jonadab at bright.net
Sun Feb 6 21:01:33 PST 2005


Nicu Buculei <nicu at apsro.com> writes:

> i remember we talked several months ago about this problem and IIRC is
> generated by the upload tool.

The upload tool's approach to character encoding is completely naive:
it just assumes that whatever form it _receives_ the data in is
suitable also for _storing_ it.  (At least, that's how it's supposed
to work.)  It doesn't care whether the data is ISO-8859-15 or what.
(For the filename, it uses only 7-bit printable ASCII characters, but
it doesn't remove non-ASCII characters from the actual metadata or
change them in any way; it just represents them as underscores in the
filename.)  It does need the data to be in some encoding that encodes
certain characters the same as in ASCII -- mainly the characters that
have special significance to XML, such as < and > and / and so forth,
plus the letters in certain tags (s, v, g, r, d, f, and so forth).
But those are all 7-bit characters.  It won't be able to handle EBCDIC
data, for example, but any sane, ASCII-compatible encoding should work
just fine.  In theory.

I do not off the top of my head know what character set ISO-8859-15
is, other than that I think all the ISO-8859-anything charsets are
fully ASCII-compatible in the bottom seven bits.  And it was my
understanding that UTF8 has this property also.  So in *theory* it
should Just Work (in the sense of not making any undesired changes).

The one problem I can think of off the top of my head that could occur
with this is if the data that the upload tool receives is not
consistent in its encoding -- e.g., if the SVG it receives is in one
encoding, and the metadata the user fills in on the form is sent by
the browser in a different encoding.  Is it possible that that is what
happened here?

> is saved as ISO-8859-15 

I was unaware that the filesystem maintained character-set metadata.
What does it mean for a file to be "saved as" ISO-8859-15?  How can
you tell what character set a file uses, apart from looking at the
charset information in the XML declaration?

More to the point, how can the script detect what encoding the
information it's receiving is encoded in, short of asking the user?

Maybe we need a Unicode guru.  I'm not one.

Alternatively:  does RDF allow for non-ASCII characters in the
metadata to be encoded as entities?  Could we just use something along
the lines of HTML::Entities to encode it (so that e.g. the problematic
character in the file in question would become é or somesuch)?
Wouldn't that render the character encoding basically irrelevant?

-- 
$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}}
split//,"ten.thgirb\@badanoj$/ --";$\=$ ;-> ();print$/




More information about the clipart mailing list