[Fwd: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces]

Thu Apr 23 12:57:08 PDT 2009

Jan Matejek wrote:
> Toshio Kuratomi napsal(a):
>> Hey all,  if you have python developers in your distribution please let
>> them know this PEP has been proposed for python-3.1.  We want this
>> situation to get better on Unix rather than worse and I'm not sure I'm
>> understanding all the ramifications of this proposal.
>>
>> http://mail.python.org/pipermail/python-dev/2009-April/088919.html
> 
> So, what exactly should be the "ramifications" of this? And more
> importantly, how does that relate to us distributors?
> 
> This PEP proposes modification (actually, extension of existing
> functionality) of python's internal encodings. Nothing is changed for
> the outside. In fact, if done properly, most python applications (and
> all non-python ones) won't even need to know about it.
> It has been discussed on the list to a rather great extent. So far, the
> developers aren't sure if it is a good idea. And even if it did go into
> python 3.1, the change is fully backwards compatible.
> 
> am i missing something?
> 
I'll start "why should distributions care?".  We should care because
python3 is fundamentally broken on *nix systems.  If we give feedback to
upstream about how this impacts our common userbase (people coding in
python on Linux) then we can get this fixed.  If we don't, then python3
will just keep adding code that may not fix our problem.

Next:  "What is the problem?"  Currently, python-3.x is broken for all
*nix systems WRT unicode and 1) environment variables, 2) command-line
arguments because there's no way to access env vars and command line
args that are not in the default system encoding.  It is less than
optimal for things that touch the filesystem since using the normal,
string-oriented API will lead to silently not seeing filenames which are
not in the default system encoding.  Using the byte API always will see
the filenames but the programmer has to know to do this (and then the
programmer has to translate into strings as appropriate).

Here's an example:
The system encoding is UTF-8 and a directory contains two files, one
with filename 'ñ' (ntilde) encoded in latin-1 and one encoded in utf-8:

LANG=en_US.utf-8 ls -b
\361  ñ

LANG=en_US.utf-8 python3
>>> import os
>>> print os.listdir('.')
['ñ']
>>> print os.listdir(b'.')
[b'\xc3\xb1', b'\xf1']

This means that there will be a lot of code that will work when I code
it on my pure utf-8 system but once distributed to someone in China
using Big5 for some filenames and utf-8 for others, will suddenly break.
  Worse, the bug reports will be pretty mysterious: "Foo-app fails to
show all my files".

In turn, this means that we, as distribution packagers, will be spending
a lot of time finding python code that looks like it should work,
patching it to use bytes, then further patching it to change the bytes
into strings when displaying to the user but retaining the bytes when
talking to the filesystem.

Here's a case with env vars:

export PATH=$PATH:/home/badger/ñ:/home/badger/$'\361'
LANG=en_US.utf-8 python3
>>> import os
>>> print os.environ['PATH']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.0/os.py", line 389, in __getitem__
    return self.data[self.keymap(key)]
KeyError: 'PATH'

Ramifications to look for: How will this affect programmers?  Will they
be able to make easy mistakes that fail in corner cases that we care
about?  Will mixed encodings make things fail?  Does using the Private
Use Area conflict with the ways that we, as distributions, use the
Private Use Areas?  Will we still have to spend a lot of time submitting
patches to upstream projects for non-obvious bug reports?

-Toshio

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 197 bytes
Desc: OpenPGP digital signature
Url : http://lists.freedesktop.org/archives/distributions/attachments/20090423/813756cf/attachment.pgp