Problem Statement
Python is not uncommonly used for systems programs - which, for purposes of this HOWTO, we will define as programs that need to be able to handle binary data as strings without caring what its encoding is, except that ASCII characters in the data must be recognizable for purposes like string-matching when parsing textual protocols and output logs.
We wrote this HOWTO from experience on two motivating examples: SRC and reposurgeon. SRC is a lightweight version control system for single-developer projects which needs to be indifferent to the encoding of the file content and metadata it manages. Reposurgeon is an editor for version-control histories which, similarly, needs to be indifferent to how repository metadata and content is encoded.
The Python 2 to 3 language transition is rough on code like this. The main problem is the change from Python 2 strings containing uninterpreted bytes to Python 3 strings containing Unicode code points. What was simple in Python 2 becomes trickier in 3; if you’re not careful when forward-porting to 3, you code may throw sporadic and difficult-to-track encoding errors on binary data that was innocuous in Python 2 - or, worse, it may silently corrupt that data. Fortunately, this is a solvable problem, and we’ll show you how.
A subsidiary goal of the strategy we’ll demonstrate is to support forward-porting your programs so that they are ‘polyglot’ - run under both Python 3.3 and later and under legacy Python 2.7 installations (though not necessarily 2.6 or earlier).
You should not try targeting Python between 3.0 and 3.2. The 3.0
developers had not yet restored enough syntactic compatibility with
2.x in those early versions for this to be practical; among other
problems, you can’t use u"foo"
to prefix your Unicode
string literals. Fortunately, older distributions carrying only
Python ⇐ 3.2 defaulted "python" to Python 2, so writing code
compatible with Python 2 solves the problem.
Why this is difficult
The trickiness comes from the fact that internally, Python 3 strings are sequences of Unicode code points. Data, coming in from your environment, is a byte stream. As long as all of it is true ASCII (bytes 0x00..0x7f) Python knows to turn this into a Unicode code point corresponding to ASCII - no harm, no foul. On output, this is reversed without fuss.
Thus, Python programs operating on pure ASCII text typically don’t need to be modified much from 2 to 3. For these, the few tricky bits will be changes in library names. The 2to3 translator does a pretty good job of mapping these over. Later in this HOWTO we’ll show you a technique for writing your imports that works under either 2 or 3.
Another issue that can affect even programs handling only ASCII is that the meaning of opening a file in binary mode changes in Python 3. In Python 2, binary mode (“rb”, “wb”) is identical to non-binary mode (“r”, “w”) under Unix, but changes how line endings are processed under Windows. In Python 3, something more serious happens; reads from that file object return byte-buffer objects rather than strings.
(A note on terminology: to avoid ambiguities in the use of the term bytes, in this document we’ll use byte-buffer to name the Python 3 object that is a sequence of integers in the range 0..255. We will use byte string for the Python 2 character sequence object.)
You’ll actually want binary opens in order to prevent hidden encode/decode attempts from messing with your binary data. But it is likely to trigger mysterious errors when a naive lift of your Python 2 code tries to combine byte-buffer objects with strings, something Python 3 won’t let it do.
Everything changes when the binary data in your I/O meets Python 3 strings. Because Python 3 wants to turn input into a Unicode object, it needs to have a defined encoding from input bytes. Its default input encoding is ‘ascii’ - which will throw a UnicodeEncode error if it sees an input byte with the high bit set. Program fall down go boom.
There are analogous problems on output. Python can’t write non-ASCII Unicode characters without decoding them to some concrete byte-stream representation. Python’s default output encoding is also ‘ascii’, and it will throw a UnicodeDecode error if it sees an output byte with the high bit set.
When in doubt about what encoding is set for a file object, you can ask it. File objects have a read-only ‘encoding’ member; it defaults to None, which is the Python default encoding, which is ‘ascii’. (This is not specified in the Python documentation, but is in the official Unicode HOWTO.)
Note, however, that the ASCII default applies only to disk files. The default sys.stdin/sys.stdout/sys.stderr streams, attached to terminals, have a different default encoding which is probably (but not necessarily) ‘utf-8’. Unless you fix this, binary I/O via shell pipes can have unexpected (and wrong!) results.
Other file-like objects, such as pipes returned by the subprocess module, give you byte streams.
Finally, Python has a separate file system encoding property that applies not to file streams but file names - which are internally Unicode in Python 3 but have to be encoded/decoded to byte representations in your operating system’s file system. This too is usually utf-8, and you almost certainly should not mess with it.
What doesn’t work
If you think like a systems programmer, your first reaction to grokking this problem is more or less “OK, I’ll re-do my Python code to use byte-buffer objects everywhere and avoid the whole encoding/decoding mess.”
This might be theoretically possible. In practice it’s too difficult for programs above trivial size.
First, every single string literal anywhere in the program would have
to have a b
in front of it. That’s not a huge issue in principle,
but it’s a PITA. Second, you would have to not use sys.stdin,
sys.stdout, or sys.stderr as they are; you would have to either use the
underlying binary buffers directly, or construct your own binary streams
(by, for example, calling os.dup on each of the standard file descriptors
and using os.read and os.write directly). The risk with either of these
approaches is that some other code in the Python standard library might
try to use the built-in standard streams.
And there are still other issues. The Python 3 byte-buffer object is not a drop-in replacement for the Python 2 str object. The biggest issue is that it doesn’t support string formatting, either with the % operator or the format method. We don’t know what drove that design decision, but it certainly isn’t helpful. Also, many parts of the Python 3 standard library expect Unicode strings and either refuse to work with byte-buffers or don’t do what you would expect with them.
Even if you could make your code work this way, you’d be going against the grain of how Python 3 is designed to work, writing visually ugly code cluttered with conversions that would cause subtle maintainability problems.
What does work
The reason that things are much easier with all ASCII data is that practically every Unicode encoding in existence maps bytes 0x00..0x7f to the corresponding code points, so byte strings and Unicode strings that contain the same all-ASCII data are basically equivalent, even semantically. What usually trips people up with non-ASCII data is that the semantic meaning of bytes in the range 0x80..0xff changes from one encoding to another.
But, thinking like a systems programmer again, for many purposes the semantic meaning of bytes 0x80..0xff doesn’t matter. All that matters is that those bytes are preserved unchanged by whatever operations are done. Typical operations like tokenizing strings, looking for markers indicating particular types of data, etc. only need to care about the meaning of bytes in the range 0x00..0x7f; bytes in the range 0x80..0xff are just along for the ride.
So the trick for beating Python 3 strings into submission is to put in encoding and decoding calls where you need to, choosing a single-byte encoding that doesn’t mutate 0x80..0xff. There are many of these; most of the Latin-{1..6} sequence (aka ISO-8859-1..10) is has this property. What you do not want to do is pick utf-8 or any of the multibyte Asian encodings. Latin-1 will do fine; in fact it has an advantage over the others in memory consumption, which we’ll describe below.
This is not an entirely frictionless approach. The standard I/O streams still have to be replaced with ones that specify the encoding as Latin-1 (both coming in and going out); but now they’re not binary streams, they’re TextIOWrapper instances, just like the usual standard I/O streams, so they don’t raise the same issues as binary streams would. And since all internal data is Unicode, all the standard library APIs that expect Unicode are happy.
Also, storing data as Unicode internally does incur at least some penalty in memory usage per object. On a 64-bit Python build, our testing indicates a constant penalty of 40 bytes per object, plus a possible additional overhead depending on the length and content of the string.
(The advantage of Latin-1 as an encoding, over all the others that don’t mutate bytes in the range 0x80..0xff, is that it is the one with the least memory overhead compared to byte strings: just the constant 40-byte overhead, without any additional penalty. This is because Latin-1 is the only encoding that maps bytes 0x80..0xff to Unicode code points 128..255; this allows Python’s internal storage of Unicode strings to stay in its most memory efficient mode, where it only uses one byte of storage for each code point. All other encodings map at least some bytes to code points 256 or higher, which means Python has to use at least 2 bytes per code point internally for at least a portion of the string.)
Nevertheless, the additional storage overhead implies that for extremely large datasets it may be preferable to run under Python 2 rather than 3. For this reason, and so you don’t get trivial bug reports from trailing-edge environments, it it is desirable to port your code in such a way that it continues to run under Python 2.
Fortunately, compared to the effort required to put in the encoding/decoding calls, maintaining Python 2.7 compatibility is actually fairly easy. We’ll give some concrete guidance on it a bit later on.
Steps in the 2-to-3 transition
Before we do that, we should dispose of the obvious question: given all these problems, why port to Python 3 at all? The answer is simple: On January 1st, 2020, Python 2 will cease being maintained.
When Python 2 will start disappearing from common deployment environments after that date is unpredictable. There will be two steps to this process. In the first, "python" called from the command line or in a hashbang line will invoke Python 3 rather than Python 2, but "python2" will still invoke Python 2. In the second, python2 will disappear entirely.
The only guidance from the Python crew is PEP 394 — The "python" Command on Unix-Like Systems, the abstract of which reads as follows:
-
python2 will refer to some version of Python 2.x.
-
python3 will refer to some version of Python 3.x.
-
for the time being, all distributions should ensure that python, if installed, refers to the same target as python2, unless the user deliberately overrides this or a virtual environment is active.
-
however, end users should be aware that python refers to python3 on at least Arch Linux (that change is what prompted the creation of this PEP) so python should be used in the shebang line only for scripts that are source compatible with both Python 2 and 3.
-
in preparation for an eventual change in the default version of Python, Python 2 only scripts should either be updated to be source compatible with Python 3 or else to use python2 in the shebang line.
There you have it. Use our techniques to achieve the polyglot state recommended in the last 'graph.
The second safest thing you can do is move to pure Python3 now and put "python3" In all your hashbang lines. You’ll have a problem if someone (even inadvertently) feeds your script to Python 2 via a bare "python" command, though, so staying polyglot is still best.
Nobody knows when "for the time being" will be over.
Checklist of porting issues
This is a checklist of 2-to-3 porting transformations you can do while your code is still under 2.x.
Use functional print and float-valued division
Use functional Python-3-style print, and float-valued division.
from __future__ import print_function, division
2to3 will try to move your code to functional print. Beware that if you apply 2to3 patches against your sources more than once, it’s going to try to add more parens than you want.
2to3 will not try to fix up your division operations. Anywhere an API counts on a result being integral you need to change to //. An important case of this is array indices; however, these are easy to spot because Python will throw a TypeError when you try to index with a float.
This import is a good idea even if your program does not presently contain print operations or divisions. In the future, someone hacking on it might introduce them and cause subtle breakage under whichever major version of Python isn’t immediately being tested.
Fix up raise calls
In Python 3, the argument of an exception raise must be an object instance of a class derived from Exception. Python 2 is more relaxed. If your Python 2 code is raising something like a tuple (a not uncommon thing in older code) you’ll need to create a trivial wrapper class to hold that data and throw it instead That is, something like
raise (e1, e2, e3)
needs to become
raise MyException(e1, e2, e3)
where MyException is something like this:
class MyException(Exception):
def __init__(self, e1, e2, e3):
self.e1 = e1
self.e2 = e2
self.e2 = e2
of course, catcher code that previously looked somewhat like this
except (e1, e2, e3):
print("Exiting: error state is %s %s %s" (e1, e2, e3))
sys.exit(1)
must now look like this:
except err:
print("Exiting: error state is %s %s %s" (err.e1, err.e2, err.e3))
sys.exit(1)
Your file opens will want the b
(binary) flag
This is a no-op under 2.x in Unix; under Windows it has implications for newline handling. As previously noted, the effect under Unix is that binary I/O returns and requires byte-buffers rather than Unicode.
Most importantly, this will prevent encode/decode errors or data mangling when you do I/O on binary data.
Also note a gotcha: pickles are binary data. Change your pickle file opens to "rb" and "wb", too.
Encode regular expression patterns to ASCII
In regular expression, the semantics of some character classes (\w, \W, \b, \B, \d, \D, \s and \S) change when the string to be searched is Unicode. In Python 3 you can can defeat this by passing in the re.ASCII flag, but that flag does not exist in Python 2.
One change you can’t disable that way is interpretation of \u and \U escapes, always interpreted in Unicode literals and always not in byte-buffers.
To address all these issues, encode() the Unicode RE literal to a byte-buffer before matching or compiling, and also be sure that the data being evaluated is a byte-buffer (encoding it if necessary). Doing this keeps the semantics of your program the same across the transition, and is good practice even though systems programs are unlikely to do the sort of multi-lingual processing for which the change is an issue.
Fix up string and bytes constructors
The str() and bytes() constructors do different things in Python 2 and Python 3. In Python 2 the bytes type is a synonym for str; in Python 3 it is not.
Under Python 2, bytes(n) behaves like str(), returning a string representation of an integer; under Python 3 it returns a n-byte byte-buffer. This is less likely to trip you up than the next problem…
Under Python 2, passing a byte-buffer object to str() just gives you the
object back. Under Python 3 you get a string representation of the
byte-buffer: thus str(b’23') == b'23'
.
The insidious thing about the Python 3 stringization difference is that it also applies to the implicit stringization applied by % on strings and the format method. Suspect this is happening if your regression tests fail with extraneous instances of b appearing under Python 3.
If you were coding in Python 2 it is likely you weren’t using byte-buffer objects at all (and if you were they are identical to strings), but this problem can still pop up in Python 3 because input from binary file opens and subprocess pipes comes in as byte-buffers.
Fix up byte-buffer indexing
Because a Python 2 object returned by bytes() is just a string, indexing it yields a string. A Python 3 byte-buffer, in the other hand, is a sequence of integers, and indexing it yields an integer.
This can cause various obscure errors if, for examples, you index a string innocently read from a binary file or subprocess and try to combine it with another string via + or %. Under Python 2 this will work; under Python 3 you might get an error from trying to concatenate an integer with a string, or a % operation might produce an integer literal where you expected a character.
Fix up integer division instances
In Python 2, x / y is truncating division that yields an integer. In Python 3 it’s float-valued. In both dialects // is truncating division. Change your / divisions to //.
You can also use this:
from __future__ import divison
In this case, / will always be float-valued, even on Python 2.
Fix up exception handling
In Python 2 it’s still possible to do:
raise MyException, message
try:
# ....
except MyException, e:
# ...
# Do something with e.message
The code can be rewritten as:
raise MyException(message)
try:
# ....
except MyException as e:
# ....
# Do something with e.args
and will work the same in Python 2 and Python 3
Fix up reduce instances
The reduce builtin is gone in Python 3. Import functools and use functools.reduce; this works in late versions of Python 2, as well.
Fix up StringIO instances
In Python 3, the StringIO module is gone - it’s imported from io instead. To run in both versions you need to do this:
include::stringio-inclusion.py
Eliminate use of the string_escape codec
Python 2’s built-in "string_escape" codec does not exist in Python 3. Instead of
r = s.encode('string_escape')
write
r = repr(s)[1:-1]
The second version works under both Python 2 and Python 3.
Port from Gtk2 to Gtk3
The pygtk bindings used by Python 2 were deprecated in 2011, and at least some Linux distributions (including Ubuntu) are not packaging them for Python 3. This means that to be polyglot you need to migrate to the new GTK bindings based on object introspection, python-GI.
Follow the general directions you’ll find here. That is, begin by making these three changes if your code is not already up to date with pygtk 2.24:
-
widget.window should be widget.get_window()
-
container.child should be container.get_child()
-
widget.flags() & gtk.REALIZED should be container.get_realized()
Here’s one the page doesn’t list.
-
get_size() calls need to become get_geometry()[2:4]
You can and should test these in place before going further.
Next, apply the pygi-convert.sh script. As the directions say, this does a surprisingly good job of source conversion. However, there is one place it seriously falls down. If you had an expose_event handler, the signature it requires has changed incomatibly and needs to be fixed.
The name of the signal has changed from "expose_event" to "draw". Where it used to take a Gdk event object it now takes a Cairo context. If you were counting on that event to give you the expose-area size, you lose. What you need to do is add another handler for the allocation event. That is, your code needs to go from a setup something like this:
def __init__(self):
GObject.GObject.__init__(self)
self.connect('expose_event', self.expose_event)
def expose_event(self, _unused, event, _empty=None):
self.cr = self.window.cairo_create()
self.cr.rectangle(
event.area.x,
event.area.y,
event.area.width,
event.area.height
)
self.cr.clip()
to something like this:
def __init__(self):
GObject.GObject.__init__(self)
self.connect('size-allocate', self.on_size_allocate)
self.width = self.height = 0
self.connect('draw', self.draw)
def on_size_allocate(self, _unused, allocation):
self.width = allocation.width
self.height = allocation.height
def draw(self, _unused, _ctx):
self.cr = self.get_window().cairo_create()
self.cr.rectangle(0, 0, self.width, self.height)
self.cr.clip()
Another, minor point is that while the pygi-convert.sh script will automatically set up a Gtk import for you, it doesn’t do likewise for Gdk. So if you have, say, gtk.gdk.color_parse() calls in your code, they’ll mutate to Gdk.color_parse() but you need to add "from gi.repository import Gdk" to the import list yourself.
All these changes can be tested before you move to Python 3.
Stay away from the email package
If you’re writing a systems programming tool, it probably doesn’t need to use Python’s email package; but you should be aware that the Python 3 version of this package insists on transforming your data in various ways to make it compliant with the email-related RFCs. The Python 2.7 version of this package, while it has a similar module layout, does not transform your data in the same way as the Python 3 version; so code that ran happily under Python 2.7 can break in various hard-to-debug ways under Python 3. If for some reason you need to use this package, proceed with extreme caution when trying to port your project to Python 3.
Be aware that you may need to roll your own RNG
Given the same seed value, the Mersenne Twister implementations in Python 2 and Python 3 will yield different sequences of pseudorandom numbers.
This means that if you have a program that relies on reproducability of random-number sequences from a seed, you will need to roll your own random number generator.
Steps to a working forward-port
Make your changes testable
First, have a decent regression- and functional-test suite for your program. If you don’t have one, write one now. This may sound like a step you can skip, but if you do…count on it that the Dread God Finagle and his mad prophet Murphy will cause you much more pain later in the process than you think you’re avoiding now.
Run through the porting-issues checklist
This procedure assumes you are starting with your shebang line as
#!/usr/bin/env python2
so you nail down what version you are testing
with.
The first step is to run through the porting-issues checklist in the
previous section. Doing these changes while the code is still running
under Python 2 should not cause any issues under Python 3, but to be
extra sure you should finish this step by temporarily changing your
shebang line to #!/usr/bin/env python3
and running your tests.
Fix up imports
Run 2to3 on your program, apply the patch it generates, and (this is important) partially revert what it does so the result runs correctly under Python 2.
For example, if your program uses the Python 2 ConfigParser library, 2to3 is going to change this to the Python 3 name configparser. What you need to do is yank that out and add this code snippet just after your general imports:
try:
import configparser
except ImportError:
import ConfigParser as configparser
Apply a similar pattern to all the other simple library name changes; try to import the Python 3 version, and if that fails substitute in the Python 2 version.
A similar piece of magic autoadapts to the name change between Python 2 raw_input and Python 3 input. The only difference here is that Python 2 also defines input as a builtin, so to avoid colliding with it, you have to pick your own name that will point to the right function in both Python versions, and write, for example
try:
my_input = raw_input
except NameError:
my_input = input
and then replace all your calls to raw_input with calls to my_input.
As another example, the intern builtin function in Python 2 becomes sys.intern in Python 3. So:
if not hasattr(sys, 'intern'):
sys.intern = intern
It’s good practice to make names look like Python 3’s where possible, as in the above example; this will minimize code churn if you ever decide to leave Python 2 support behind. However, in some cases, the spelling you have to use to keep your code running on both Python 2 and Python 3 will look like the Python 2 spelling instead of the Python 3 one. For example:
try:
xrange
except NameError:
xrange = range
The xrange builtin exists in Python 2 and has the same behavior as the range builtin in Python 3; but you can’t use the range spelling because range also exists as a builtin in Python 2, but it returns a list, not an opaque, immutable sequence. A similar strategy can be used with generators:
if not hasattr(itertools, 'imap'):
itertools.imap = map
if not hasattr(itertools, 'izip'):
itertools.izip = zip
if not hasattr(itertools, 'ifilterfalse'):
itertools.ifilterfalse = itertools.filterfalse
map and zip are builtins in Python 2 as well as Python 3, but in Python 2 they return lists, not generators. So you have to use the itertools spelling to get the generator behavior in both versions.
Sometimes you need to be a bit more fine-grained. For example, in Python 2 getstatusoutput() is a method of the commands library module; in Python 3 there is no commands and the method moves to subprocess. You can get around this by writing
# Warning: In some Python 3 versions getstatusoutput() returns
# status incorrectly so that a nonzero exit looks like the subprocess
# was signaled! (Observed under 3.4.3; Debian bug #764848)
try:
from subprocess import getstatusoutput
except ImportError:
from commands import getstatusoutput
and then replacing all your commands.getstatusoutput calls with getstatusoutput.
Warning: In some Python 3 versions getstatusoutput() returns status incorrectly so that a nonzero exit looks like the subprocess was signaled! (Observed under 3.4.3; Debian bug #764848) It is likely this will not affect your program unless you are trying to distinguish between these cases.
Your objective at this stage is not yet to move fully to 3, so it’s possible you might need to back out some 2to3 patchbands that make incompatible changes relating to strings and unicode. Save these, you’ll want them for a later stage.
At the end of this step, you should have a kind of amphibian - a working 2.7 program, passing your regression tests, that does Python 3 imports when run under 3 and does binary (implicitly byte-buffer) I/O. However, this amphibian probably will not run correctly under Python 3.
The reason you wanted this as a separate step is so that you can do the next bit - actually making it run under 3 - with the serious string-vs.-unicode issues separated from the syntax tweaks and import munging that 2to3 does for you.
Fix up iterator methods
In Python 3, the next method of iterators is renamed to __next__. 2to3 does this renaming. For compatibility to Python 2, you should add a method alias
next = __next__
immediately after each transformed __next__ definition, e.g.
def next(self):
example_iterator(self)
should become
def __next__(self):
example_iterator(self)
next = __next__
Alternative: use a helper library
six
Six is a library that helps you write Python 2 / Python 3 compatible code.
It can take care of fixing the imports, using something like
from six.moves import input
It also allows you to use metaclasses both in Python 2 and Python 3, even if the syntax between the two is completely different.
Alternatives to six
-
pies is an alternative to
six
you may want to consider. See pie’s README on github for the details. -
python-future is also interesting, since it contains tools that contrary to 2to3 will generate Python 2 / Python 3 compatible code directly.
Fix up string/unicode mixing
Now it’s time to tweak your shebang line to #!/usr/bin/env python3
and make that work.
This is going to consist mostly of adding encode() and decode() calls to change data between string and unicode types. This is the heavy lifting in your Python 3 port. Because of the ASCII-compatible, 0x80..0xff-preserving encoding you’ve chosen, these will be no-ops under Python 2.
The art here is in doing as little work as possible. Your encode() and decode() calls should intercept your binary I/O close to where it happens, so the bulk of your code is just seeing Unicode strings.
This is also the stage at which you may need to tag some literals with
a prefix b
for byte-buffer. Beware, if you have a lot of these it may
mean you have not put encode/decode calls near enough to the natural
choke points where your binary I/O is happening.
You may have to fix up string-to-byte concatenations as well. Again, you’ll minimize effort by moving these conversions as close to the I/O source or sink of the data as possible so that the interior computations are always done with Unicode strings.
Here is an error message you may see during conversion:
- TypeError: str does not support the buffer interface
-
You handed a string to a function that was expecting a byte-buffer object. A common example of this is passing a string to the write method of a file object you opened in binary mode. To fix this, encode the string value to latin-1
It is worth noting that the above strategy, using encode() and decode() calls with no other checking, relies on two key properties of Python 2’s handling of byte-buffer and Unicode strings:
-
In string operations, you can mix byte-buffers and Unicode strings, and Python 2 will silently convert between them whenever it needs to. This allows the same code to run on both Python 2 and Python 3 without having to worry about the fact that under Python 2, a single operation might be mixing byte-buffers and Unicode strings (for example, calling format or using the % operator with a string literal as the format and data strings that are actually Unicode).
-
The str and unicode objects both have encode and decode methods (unlike Python 3, where only str has encode and only bytes has decode); a str encodes to itself, and a unicode decodes to itself. This allows you to avoid doing explicit isinstance checks to make sure you don’t call encode or decode on the wrong type of object; since your code will be mixing byte-buffers and Unicode strings, you won’t always be able to keep track of which type is being operated on at a particular point in your code.
If the above makes you nervous, however, there is a trick that avoids having to use Unicode at all under Python 2. Consider this snippet:
# Any encoding that preserves 0x80...0xff through round-tripping from byte
# streams to Unicode and back would do, latin-1 is the best known of these.
import io
binary_encoding = "latin-1"
if str is bytes: # Python 2
polystr = str
polybytes = bytes
polyord = ord
polychr = str
else: # Python 3
def polystr(o):
if isinstance(o, str):
return o
if isinstance(o, bytes):
return str(o, encoding=binary_encoding)
raise ValueError
def polybytes(o):
if isinstance(o, bytes):
return o
if isinstance(o, str):
return bytes(o, encoding=binary_encoding)
raise ValueError
def polyord(c):
"Polymorphic ord() function"
if isinstance(c, str):
return ord(c)
else:
return c
def polychr(c):
"Polymorphic chr() function"
if isinstance(c, int):
return chr(c)
else:
return c
def polystream(stream):
"Standard input/output wrapper factory function"
# This ensures that the encoding of standard output and standard
# error on Python 3 matches the binary encoding we use to turn
# bytes to Unicode in polystr above
# newline="\n" ensures that Python 3 won't mangle line breaks
# line_buffering=True ensures that interactive command sessions work as expected
return io.TextIOWrapper(
stream.buffer, encoding=binary_encoding, newline="\n", line_buffering=True
)
sys.stdin = polystream(sys.stdin)
sys.stdout = polystream(sys.stdout)
sys.stderr = polystream(sys.stderr)
Under Python 2, str and bytes both refer to the same type object, so all that happens is that polystr and polybytes are aliased to that type object. But under Python 3, the polystr and polybytes functions are the equivalent of the encode and decode calls described above. So if you use polystr whenever you want to decode incoming data, and polybytes whenever you want to encode outgoing data, then under Python 2 your code will be using byte strings everywhere; it will only do Unicode conversions under Python 3. The only thing you need to decide is what to do if these functions receive an argument that isn’t a string at all. The above functions raise an exception, which is probably what you want if you want to make sure the functions only get used for the specific purpose of string data conversion. But there might be use cases where it makes sense to do something else.
(There are also polyord() and polychr() function in this wrapper; polyord() prevents the lossage that otherwise happens when calling ord() on an element of a byte buffer in Python 3, while polychr() prevents problems going in the opposite direction.)
Another item in this code snippet is worth noting: under Python 3, when
constructing the alternate I/O streams (note that the above snippet doesn’t
do anything to them under Python 2), you have to set the newline
parameter to \n
, as shown, or you will have problems with the way
Python handles line breaks in your data. There are actually two issues
here. The first is newline translation: by default, Python 3 opens text
files in “universal newlines” mode, in which it automatically translates
all non-Unix newline markers it finds (i.e., DOS-style \r\n
newlines
and MAC-style \r
newlines) into its chosen newline marker for internal
operations, which is the Unix newline, \n
. Once the translation is
done, there’s no way to recover the original newlines. Obviously you
don’t want this default behavior.
You can stop Python from translating line breaks when reading files by
passing any value for the newline parameter except None. However,
when writing files, Python will translate newlines if you pass anything
but a blank string ''
or \n
as the newline parameter. If you pass
None, or accept Python’s default behavior, any \n
characters will
get translated to the system default line separator, os.linesep, on
writing. If you pass the DOS or MAC newline, any \n
characters will
get translated to that newline. This behavior is rather counterintuitive;
you might think that, if your data has all DOS newlines, you would want
to tell Python that by passing newline="\r\n"
when writing a file. In
fact, what that will do is make Python translate every \r\n
to \r\r\n
when writing the file! This is because, when writing, Python just looks
at the \n
, interprets it as a newline (since that’s its internal
newline character, as above), and translates it to \r\n
.
Further, there is the second issue, which is string operations. If you
pass Python anything but \n
as the newline parameter to a file you
open for reading, then line-related operations on that file, such as
readlines() or for line in file, will break lines at markers other
than \n
. However, once the incoming data from that file is stored as
a Unicode string, any line-related operations on that string, such as
splitlines(), will only break lines at \n
. So using anything other
than \n
as the newline parameter creates a mismatch between the way
Python processes text files and the way it processes text strings.
You could conceivably try to work through all this, but it’s much better
to just avoid the problem by using \n
as the newline parameter for all
files, and accepting that all your program’s internal data will be using \n
as the newline marker, in accordance with Python’s internal data model.
Finally, if your program is going to be used interactively, you will want to set line_buffering=True, as shown in the code above, so that interactive sessions will work as expected.
Fix sort() calls
Remember that in Python 2 the sort() method takes a two-argument comparison function, but in Python 3 it takes a "sort key" function which must return a comparison value and be passed in as the value of the keyword argument "key=".
If you’re sorting numbers or strings or anything else for which there
is a a strict ordering in Python, your key function can just be
lambda: x x
. (This is actually the default behavior if you don’t
specify a key.)
If you’re sorting on tuples and want the usual behavior (earliest tuple member is most significant for sort value) remember that Python uses a stable sort, so sorting on each tuple member in turn will give the desired effect.
Let’s say you already had a sortvalue() function for the objects you’re sorting. Then polyglot code will, at worst, look like this:
try:
sequence.sort(lambda x, y: sortvalue(x) - sortvalue(y))
except TypeError:
sequence.sort(key=sortvalue)
Actually, since Python 2.4, the sort builtin accepts a named "key=" argument and "sequence.sort(key=sortvalue)" will do.
Fix dictionary views
Let’s say you have some code like this.
my_dict = { "a" : 1 }
keys = my_dict.keys()
By default, when you run 2to3
, your code will be changed to:
my_dict = { "a" : 1 }
keys = list(my_dict.keys())
This is because in Python 3, keys()
returns a dictionary view, which is
different from the list you get in Python 2, and is also different from
the iterator you get with iterkeys()
on Python 2
But in most cases, you just want to iterate over the keys, so we
recommend using 2to3
with --nofix=dict
.
Be careful though, code will blow up if you have something like:
my_dict = { "a" : 1 }
keys = my_dict.keys()
keys.sort()
That’s because dictionary views do not have a sort()
method.
Instead, write something like:
my_dict = { "a" : 1 }
keys = my_dict.keys()
keys = sorted(keys)
Another gotcha is when you change the dictionary:
for key in my_dict.keys():
if something(key):
del my_dict[key]
Here there’s no choice but converting to a list:
for key in list(my_dict.keys()):
if something(key):
del my_dict[key]
Fix up exception classes
In some older (pre-2.6) versions of Python 2, in order for a class to
be thrown or caught as an exception value, it had to inherit from
exceptions.Exception, and the exceptions
module had to be imported
to support that.
In Python 3, the exceptions module does not exist. Instead,
exception classes must inherit from a builtin class called
BaseException
.
In Python 2 at versions 2.6 and later, it was possible to inherit from BaseException as in 3, but the older way of inheriting from exceptions.Exception also worked.
If you have exceptions
in your imports remove it. Then, make your
exception classes inherit from BaseException
.
Wrapping up
Now you can hack the shebang line so it says #!/usr/bin/env python
confident that the program will work whether the environment defaults
to Python 2 or Python 3.
Final step: tweak your regression tests so that they run twice, once under python2 and once under Python 3. This way you’ll avoid unpleasant surprises during later modifications.
The setpython script can be some help with this:
#!/bin/sh
#
# setpython - create a local link from 'python' to a specified version
#
# This script is used to to redirect the 'python' in a shebang line to
# a specified version when running regression tests.
if [ -z "$1" ]
then
ls -l python
elif [ $1 = "python" ]
then
rm -f ./python
elif [ $1 = python2 -o $1 = python3 ]
then
set -- `whereis $1`
shift
case $1 in
*/bin/*) ln -sf $1 ./python; echo "python -> $1";;
*) echo "setpython: no python binary" >&2;;
esac
else
echo "setpython: unrecognized python version" >&2
fi
This helps you redirect a #!/usr/bin/env python
shebang to your
choice of a Python 2 or 3 version without actually modifying the
Python code. To use it, ensure that the $PATH seen by your test scripts
includes . (the current directory) so ./python will intercept the
python invocation in the shebang.
Worked examples
In the SRC repository, look at Peter Donis’s commits on 2016-02-16 and 2016-02-17. A second series by ESR (but based on Peter’s theory) adds the high-byte-preserving Latin-1 encoding for sys error streams, on 2016-02-21 just following the 1.10 tag.
Peter did most of this port while we were still working out how to systematize the process and exploit Latin-1, so the sequence of steps does not exactly match the ideal version described in the previous section.
In the reposurgeon repository, the commits beginning with “Remove some Python 2 functions that aren’t in Python 3.” on 2016-02-23 and ending with “Cleanup changes for the Python 3 port.” on 2016-02-24 are effectively all of the Python 3 port changes. There was a small amount of preparation before these and some tying up of loose ends afterward, but they show the major porting steps you will have to do pretty well.
In the deheader repository, you can see an example of how much smaller the polyglot changes get when you don’t have binary data to worry about. Look at the four changesets following the 1.3 tag. The only serious issue here was that an iteration over a dictionary had to be replaced by an iteration over its sorted keys, because the dictionary traversal order in 2 and 3 is different. For safety, the sort used in directory entries was also strengthened a little.
In the GPSD repository, the xgpsspeed test client is a worked example of the Gtk2 to Gtk3 port procedure. Most of the work is done in "Eliminate use of event argument in the xgpsspeed draw handler." (2016-03-25T00:51:50) and "xgpsspeed successfully ported to python-gi" (2016-03-25T01:03:35); the second one is where pygi-convert.sh was applied. A bit of cleanup follows.
References
Improving this HOWTO
If, when you apply this method, you solve a problem we haven’t covered, please tell us about it so we can improve the HOWTO.
The master of this document is hosted on gitlab at
git@gitlab.com:esr/practical-python-porting.git
The Python code snippets for autoadapting to Python 2 or 3 are available as standalone code files in that repository.
Change history
For fine-grained changes, look in the repository history.
- 1.0
-
Initial release.
- 1.1
-
Explain how to fix up sort calls and exception-class declarations.
- 1.2
-
Add a checklist item on functional print. Revised the section on exception classes. Substantial new material on division, exceptions, dictionary views, and other topics by Dimitri Merejkowsky.
- 1.3
-
Add deheader as a simpler example without binary-data problems.
- 1.4
-
Add a warning about buggy behavior of getstatusoutput() in Python 3.
- 1.5
-
Add advice to import float division - unconditionally.
- 1.6
-
Details on porting from pygtk/gobject for Gtk2 to python-GI for Gtk3.
- 1.7
-
Added "Fix up iterator methods".
- 1.8
-
Added polyord() and polychr() to polystr-inclusion.py.
- 1.9
-
How to fix up raise calls. Simpler sort porting.
- 1.10
-
Replacing use of string_escape codec.
- 1.11
-
Pickle file opens must be binary. What to do about reduce and StringIO.
- 1.12
-
Reference PEP394. Rolling your own RNG may be necessary