Richard Jones' Log: py3k - I'm excited

Sun, 07 Dec 2008

Python 3.0 that is.

As someone who has to deal with unicode and charset issues I'm really looking forward to the day I can ditch the current two string types and all the confusion and bugs that arise from them. Unicode / encoding will be so much easier to explain now!

I also learnt something new in Anthony's "what's new in Python" talk at OSDC: comprehensions are everywhere now!

>>> [chr(c) for c in (83, 80, 65, 77)]
['S', 'P', 'A', 'M']
>>> {chr(c) for c in (83, 80, 65, 77)}
{'A', 'P', 'S', 'M'}
>>> {c:chr(c) for c in (83, 80, 65, 77)}
{80: 'P', 65: 'A', 83: 'S', 77: 'M'}
>>> ''.join(chr(c) for c in (83, 80, 65, 77))
'SPAM'

That's awesome (for those new to things, that's a list, set, dictionary and generator comprehension).

Also, kudos to Ubuntu. "apt-get install python3" worked, and more importantly didn't replace the "python" command (rather requiring "python3" to invoke it). Unfortunately the same isn't true under Windows where the installer replaces the existing ".py" file association to point to Python 3 instead of 2.6, thus breaking all installed Python scripts. Boo.

Comment by Fredrik on Sun, 07 Dec 2008

Some day, I'm going to figure out why everyone's so convinced that going from two string types (byte strings and Unicode strings) to two string types (byte buffers and Unicode strings) will somehow magically make all Unicode issues go away...

Comment by sage on Sun, 07 Dec 2008

@Fredrik

Current (2.x) string implementation have a bug. Not a code-bug, but a conceptual bug. The fact that the "main string type" (<type 'str'>) is either seen as a string (a list of characters) or as a buffer (a list of bytes).

The problem with current implementation is that a lot of code will use <type 'str'> as a list of characters, when they should be using <type 'unicode'>, that will lead into "automatic conversions", those "automatic conversions" will most likely use the "ASCII" encoding, which will fail when encouting a 0x80-0xff char.

In fact, we just need to use a list of "bytes" (as <type 'str'>) only when dealing with I/O (low level socket, files) but usully never when using strings as list of characters (as 'egg' is the letter 'e' then 'g' then 'g', whatever I use to encode 'e' or 'g')

All right, let's just use u'...' everywhere in our code instead of '...' (except when I want explicitly to use byte buffers).

This will still have some bugs because a lot of 3rd party code (including standard library) will use (and send you) <type 'str'> when you except <type 'unicode'>.

So let's just convert those 3rd party code. But that will break other codes that suppose that the 3rd party implementation work with <type 'str'> and not with <type 'unicode'>.

Need example ?

Ok, let's examine os.listdir

Environnement : python 2.5 on windows.

let's use os.listdir on a directory containing a file called "caf�" (note the "�" in the name, unicode 0xe9, under windows using encoding windows-1252, it's also 0xe9)

  1. >>> import os
  2. >>> print os.listdir('.')
  3. [ 'caf�' ]

ok... It's <type 'str'> not <type 'unicode'>

  1. Let's just try :
  2. >>> import os
  3. >>> for filename in os.listdir('.') :
  4. ... print filename+u''
  5. UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)

ok, so let's convert filename to uncode explictly to unicode. Good ! Unfortunatly this is not good. Beacause your code will then work under windows, but fail under linux. Because under linux, os.listdir will return you a list of <type 'unicode'> ..!!??!!??!!!!

Fortunatly there is an ugly boolean in the os namespace that will tell you if listdir will return you "str" or "unicode". Good... Hey wait ! I don't code in python just to face some API mess like I face when coding in C/Win32 crap !!!! I code in python because I want some nice easy to use API.

You can blame the os implementations. Yes, you can. But the real problem is that 'str' in python 2.x is NOT a string, it's just a byte buffer, and as such, whould never be used to represent a list of characters redingless of it's encoding. Doing that add EXTRA BUGS. That's just what python 3.0 is all about.

So python 3.x won't 'make all Unicode issues magically go away' just 'make EXTRA Unicode issues magically go away'.

When I face complex unicode issues in python, they are quite always due to those EXTRA unicode issues.

Comment by sage on Sun, 07 Dec 2008

@Richard Jones
"Unfortunately the same isn't true under Windows where the installer replaces the existing ".py" file association to point to Python 3 instead of 2.6, thus breaking all installed Python scripts. Boo."

Just relaunch python 2.6 installer after python 3.0 installer and it will correct the whole thing.

Better launch python 2.6.1 installer, that way you'll have the last uptodate python version.

Comment by Tzury Bar Yochay on Sun, 07 Dec 2008

>>> [chr(c) for c in (83, 80, 65, 77)]
['S', 'P', 'A', 'M']

and

>>> ''.join(chr(c) for c in (83, 80, 65, 77))
'SPAM'

are not new at all, Am I missing something?

Comment by John M. Camara on Sun, 07 Dec 2008

It's to bad that Python 3 didn't change the file extension to .py3. This would have solved the windows issue as well as making it obvious what code has already been ported to Python 3.

Comment by patrick on Sun, 07 Dec 2008

">>> [chr(c) for c in (83, 80, 65, 77)]
['S', 'P', 'A', 'M']

and

>>> ''.join(chr(c) for c in (83, 80, 65, 77))
'SPAM'

are not new at all, Am I missing something?"

Yes, these are old, but the point is that now you don't have to think "How should I convert my data to a form that I can use some kind of comprehension on." Now you can just think "a comprehension would make sense here" and use it on your data, regardless of the datatype.

Comment by gumuz on Mon, 15 Dec 2008

Hi, just wanted to say that the Python3 installer for windows lets you uncheck the 'make this the default python installation' option, which will leave everything intact, including filetype associations.

cheers