Practical Python porting for systems programmers

Problem Statement

Python is not uncommonly used for systems programs - which, for purposes of this HOWTO, we will define as programs that need to be able to handle binary data as strings without caring what its encoding is, except that ASCII characters in the data must be recognizable for purposes like string-matching when parsing textual protocols and output logs.

We wrote this HOWTO from experience on two motivating examples: SRC and reposurgeon. SRC is a lightweight version control system for single-developer projects which needs to be indifferent to the encoding of the file content and metadata it manages. Reposurgeon is an editor for version-control histories which, similarly, needs to be indifferent to how repository metadata and content is encoded.

The Python 2 to 3 language transition is rough on code like this. The main problem is the change from Python 2 strings containing uninterpreted bytes to Python 3 strings containing Unicode code points. What was simple in Python 2 becomes trickier in 3; if you’re not careful when forward-porting to 3, you code may throw sporadic and difficult-to-track encoding errors on binary data that was innocuous in Python 2 - or, worse, it may silently corrupt that data. Fortunately, this is a solvable problem, and we’ll show you how.

A subsidiary goal of the strategy we’ll demonstrate is to support forward-porting your programs so that they are ‘polyglot’ - run under both Python 3.3 and later and under legacy Python 2.7 installations (though not necessarily 2.6 or earlier).

You should not try targeting Python between 3.0 and 3.2. The 3.0 developers had not yet restored enough syntactic compatibility with 2.x in those early versions for this to be practical; among other problems, you can’t use u"foo" to prefix your Unicode string literals. Fortunately, older distributions carrying only Python ⇐ 3.2 defaulted "python" to Python 2, so writing code compatible with Python 2 solves the problem.

Why this is difficult

The trickiness comes from the fact that internally, Python 3 strings are sequences of Unicode code points. Data, coming in from your environment, is a byte stream. As long as all of it is true ASCII (bytes 0x00..0x7f) Python knows to turn this into a Unicode code point corresponding to ASCII - no harm, no foul. On output, this is reversed without fuss.

Thus, Python programs operating on pure ASCII text typically don’t need to be modified much from 2 to 3. For these, the few tricky bits will be changes in library names. The 2to3 translator does a pretty good job of mapping these over. Later in this HOWTO we’ll show you a technique for writing your imports that works under either 2 or 3.

Another issue that can affect even programs handling only ASCII is that the meaning of opening a file in binary mode changes in Python 3. In Python 2, binary mode (“rb”, “wb”) is identical to non-binary mode (“r”, “w”) under Unix, but changes how line endings are processed under Windows. In Python 3, something more serious happens; reads from that file object return byte-buffer objects rather than strings.

(A note on terminology: to avoid ambiguities in the use of the term bytes, in this document we’ll use byte-buffer to name the Python 3 object that is a sequence of integers in the range 0..255. We will use byte string for the Python 2 character sequence object.)

You’ll actually want binary opens in order to prevent hidden encode/decode attempts from messing with your binary data. But it is likely to trigger mysterious errors when a naive lift of your Python 2 code tries to combine byte-buffer objects with strings, something Python 3 won’t let it do.

Everything changes when the binary data in your I/O meets Python 3 strings. Because Python 3 wants to turn input into a Unicode object, it needs to have a defined encoding from input bytes. Its default input encoding is ‘ascii’ - which will throw a UnicodeEncode error if it sees an input byte with the high bit set. Program fall down go boom.

There are analogous problems on output. Python can’t write non-ASCII Unicode characters without decoding them to some concrete byte-stream representation. Python’s default output encoding is also ‘ascii’, and it will throw a UnicodeDecode error if it sees an output byte with the high bit set.

When in doubt about what encoding is set for a file object, you can ask it. File objects have a read-only ‘encoding’ member; it defaults to None, which is the Python default encoding, which is ‘ascii’. (This is not specified in the Python documentation, but is in the official Unicode HOWTO.)

Note, however, that the ASCII default applies only to disk files. The default sys.stdin/sys.stdout/sys.stderr streams, attached to terminals, have a different default encoding which is probably (but not necessarily) ‘utf-8’. Unless you fix this, binary I/O via shell pipes can have unexpected (and wrong!) results.

Other file-like objects, such as pipes returned by the subprocess module, give you byte streams.

Finally, Python has a separate file system encoding property that applies not to file streams but file names - which are internally Unicode in Python 3 but have to be encoded/decoded to byte representations in your operating system’s file system. This too is usually utf-8, and you almost certainly should not mess with it.

What doesn’t work

If you think like a systems programmer, your first reaction to grokking this problem is more or less “OK, I’ll re-do my Python code to use byte-buffer objects everywhere and avoid the whole encoding/decoding mess.”

This might be theoretically possible. In practice it’s too difficult for programs above trivial size.

First, every single string literal anywhere in the program would have to have a b in front of it. That’s not a huge issue in principle, but it’s a PITA. Second, you would have to not use sys.stdin, sys.stdout, or sys.stderr as they are; you would have to either use the underlying binary buffers directly, or construct your own binary streams (by, for example, calling os.dup on each of the standard file descriptors and using os.read and os.write directly). The risk with either of these approaches is that some other code in the Python standard library might try to use the built-in standard streams.

And there are still other issues. The Python 3 byte-buffer object is not a drop-in replacement for the Python 2 str object. The biggest issue is that it doesn’t support string formatting, either with the % operator or the format method. We don’t know what drove that design decision, but it certainly isn’t helpful. Also, many parts of the Python 3 standard library expect Unicode strings and either refuse to work with byte-buffers or don’t do what you would expect with them.

Even if you could make your code work this way, you’d be going against the grain of how Python 3 is designed to work, writing visually ugly code cluttered with conversions that would cause subtle maintainability problems.

What does work

The reason that things are much easier with all ASCII data is that practically every Unicode encoding in existence maps bytes 0x00..0x7f to the corresponding code points, so byte strings and Unicode strings that contain the same all-ASCII data are basically equivalent, even semantically. What usually trips people up with non-ASCII data is that the semantic meaning of bytes in the range 0x80..0xff changes from one encoding to another.

But, thinking like a systems programmer again, for many purposes the semantic meaning of bytes 0x80..0xff doesn’t matter. All that matters is that those bytes are preserved unchanged by whatever operations are done. Typical operations like tokenizing strings, looking for markers indicating particular types of data, etc. only need to care about the meaning of bytes in the range 0x00..0x7f; bytes in the range 0x80..0xff are just along for the ride.

So the trick for beating Python 3 strings into submission is to put in encoding and decoding calls where you need to, choosing a single-byte encoding that doesn’t mutate 0x80..0xff. There are many of these; most of the Latin-{1..6} sequence (aka ISO-8859-1..10) is has this property. What you do not want to do is pick utf-8 or any of the multibyte Asian encodings. Latin-1 will do fine; in fact it has an advantage over the others in memory consumption, which we’ll describe below.

This is not an entirely frictionless approach. The standard I/O streams still have to be replaced with ones that specify the encoding as Latin-1 (both coming in and going out); but now they’re not binary streams, they’re TextIOWrapper instances, just like the usual standard I/O streams, so they don’t raise the same issues as binary streams would. And since all internal data is Unicode, all the standard library APIs that expect Unicode are happy.

Also, storing data as Unicode internally does incur at least some penalty in memory usage per object. On a 64-bit Python build, our testing indicates a constant penalty of 40 bytes per object, plus a possible additional overhead depending on the length and content of the string.

(The advantage of Latin-1 as an encoding, over all the others that don’t mutate bytes in the range 0x80..0xff, is that it is the one with the least memory overhead compared to byte strings: just the constant 40-byte overhead, without any additional penalty. This is because Latin-1 is the only encoding that maps bytes 0x80..0xff to Unicode code points 128..255; this allows Python’s internal storage of Unicode strings to stay in its most memory efficient mode, where it only uses one byte of storage for each code point. All other encodings map at least some bytes to code points 256 or higher, which means Python has to use at least 2 bytes per code point internally for at least a portion of the string.)

Nevertheless, the additional storage overhead implies that for extremely large datasets it may be preferable to run under Python 2 rather than 3. For this reason, and so you don’t get trivial bug reports from trailing-edge environments, it it is desirable to port your code in such a way that it continues to run under Python 2.

Fortunately, compared to the effort required to put in the encoding/decoding calls, maintaining Python 2.7 compatibility is actually fairly easy. We’ll give some concrete guidance on it a bit later on.

Steps in the 2-to-3 transition

Before we do that, we should dispose of the obvious question: given all these problems, why port to Python 3 at all? The answer is simple: On January 1st, 2020, Python 2 will cease being maintained.

When Python 2 will start disappearing from common deployment environments after that date is unpredictable. There will be two steps to this process. In the first, "python" called from the command line or in a hashbang line will invoke Python 3 rather than Python 2, but "python2" will still invoke Python 2. In the second, python2 will disappear entirely.

The only guidance from the Python crew is PEP 394 — The "python" Command on Unix-Like Systems, the abstract of which reads as follows:

python2 will refer to some version of Python 2.x.
python3 will refer to some version of Python 3.x.
for the time being, all distributions should ensure that python, if installed, refers to the same target as python2, unless the user deliberately overrides this or a virtual environment is active.
however, end users should be aware that python refers to python3 on at least Arch Linux (that change is what prompted the creation of this PEP) so python should be used in the shebang line only for scripts that are source compatible with both Python 2 and 3.
in preparation for an eventual change in the default version of Python, Python 2 only scripts should either be updated to be source compatible with Python 3 or else to use python2 in the shebang line.

There you have it. Use our techniques to achieve the polyglot state recommended in the last 'graph.

The second safest thing you can do is move to pure Python3 now and put "python3" In all your hashbang lines. You’ll have a problem if someone (even inadvertently) feeds your script to Python 2 via a bare "python" command, though, so staying polyglot is still best.

Nobody knows when "for the time being" will be over.

Checklist of porting issues

This is a checklist of 2-to-3 porting transformations you can do while your code is still under 2.x.

Use functional print and float-valued division

Use functional Python-3-style print, and float-valued division.

from __future__ import print_function, division

2to3 will try to move your code to functional print. Beware that if you apply 2to3 patches against your sources more than once, it’s going to try to add more parens than you want.

2to3 will not try to fix up your division operations. Anywhere an API counts on a result being integral you need to change to //. An important case of this is array indices; however, these are easy to spot because Python will throw a TypeError when you try to index with a float.

This import is a good idea even if your program does not presently contain print operations or divisions. In the future, someone hacking on it might introduce them and cause subtle breakage under whichever major version of Python isn’t immediately being tested.

Fix up raise calls

In Python 3, the argument of an exception raise must be an object instance of a class derived from Exception. Python 2 is more relaxed. If your Python 2 code is raising something like a tuple (a not uncommon thing in older code) you’ll need to create a trivial wrapper class to hold that data and throw it instead That is, something like

    raise (e1, e2, e3)

needs to become

    raise MyException(e1, e2, e3)

where MyException is something like this:

class MyException(Exception):
    def __init__(self, e1, e2, e3):
        self.e1 = e1
        self.e2 = e2
        self.e2 = e2

of course, catcher code that previously looked somewhat like this

    except (e1, e2, e3):
        print("Exiting: error state is %s %s %s" (e1, e2, e3))
        sys.exit(1)

must now look like this:

    except err:
        print("Exiting: error state is %s %s %s" (err.e1, err.e2, err.e3))
        sys.exit(1)

Your file opens will want the `b` (binary) flag

This is a no-op under 2.x in Unix; under Windows it has implications for newline handling. As previously noted, the effect under Unix is that binary I/O returns and requires byte-buffers rather than Unicode.

Most importantly, this will prevent encode/decode errors or data mangling when you do I/O on binary data.

Also note a gotcha: pickles are binary data. Change your pickle file opens to "rb" and "wb", too.

Encode regular expression patterns to ASCII

In regular expression, the semantics of some character classes (\w, \W, \b, \B, \d, \D, \s and \S) change when the string to be searched is Unicode. In Python 3 you can can defeat this by passing in the re.ASCII flag, but that flag does not exist in Python 2.

One change you can’t disable that way is interpretation of \u and \U escapes, always interpreted in Unicode literals and always not in byte-buffers.

To address all these issues, encode() the Unicode RE literal to a byte-buffer before matching or compiling, and also be sure that the data being evaluated is a byte-buffer (encoding it if necessary). Doing this keeps the semantics of your program the same across the transition, and is good practice even though systems programs are unlikely to do the sort of multi-lingual processing for which the change is an issue.

Fix up string and bytes constructors

The str() and bytes() constructors do different things in Python 2 and Python 3. In Python 2 the bytes type is a synonym for str; in Python 3 it is not.

Under Python 2, bytes(n) behaves like str(), returning a string representation of an integer; under Python 3 it returns a n-byte byte-buffer. This is less likely to trip you up than the next problem…

Under Python 2, passing a byte-buffer object to str() just gives you the object back. Under Python 3 you get a string representation of the byte-buffer: thus str(b’23') == b'23'.

The insidious thing about the Python 3 stringization difference is that it also applies to the implicit stringization applied by % on strings and the format method. Suspect this is happening if your regression tests fail with extraneous instances of b appearing under Python 3.

If you were coding in Python 2 it is likely you weren’t using byte-buffer objects at all (and if you were they are identical to strings), but this problem can still pop up in Python 3 because input from binary file opens and subprocess pipes comes in as byte-buffers.

Fix up byte-buffer indexing

Because a Python 2 object returned by bytes() is just a string, indexing it yields a string. A Python 3 byte-buffer, in the other hand, is a sequence of integers, and indexing it yields an integer.

This can cause various obscure errors if, for examples, you index a string innocently read from a binary file or subprocess and try to combine it with another string via + or %. Under Python 2 this will work; under Python 3 you might get an error from trying to concatenate an integer with a string, or a % operation might produce an integer literal where you expected a character.

Fix up integer division instances

In Python 2, x / y is truncating division that yields an integer. In Python 3 it’s float-valued. In both dialects // is truncating division. Change your / divisions to //.

You can also use this:

from __future__ import divison

In this case, / will always be float-valued, even on Python 2.

Fix up exception handling

In Python 2 it’s still possible to do:

raise MyException, message

try:
    # ....
except MyException, e:
    # ...
    # Do something with e.message

The code can be rewritten as:

raise MyException(message)

try:
    # ....
except MyException as e:
    # ....
    # Do something with e.args

and will work the same in Python 2 and Python 3

Fix up reduce instances

The reduce builtin is gone in Python 3. Import functools and use functools.reduce; this works in late versions of Python 2, as well.

Fix up StringIO instances

In Python 3, the StringIO module is gone - it’s imported from io instead. To run in both versions you need to do this:

include::stringio-inclusion.py

Eliminate use of the string_escape codec

Python 2’s built-in "string_escape" codec does not exist in Python 3. Instead of

r = s.encode('string_escape')

write

r = repr(s)[1:-1]

The second version works under both Python 2 and Python 3.

Port from Gtk2 to Gtk3

The pygtk bindings used by Python 2 were deprecated in 2011, and at least some Linux distributions (including Ubuntu) are not packaging them for Python 3. This means that to be polyglot you need to migrate to the new GTK bindings based on object introspection, python-GI.

Follow the general directions you’ll find here. That is, begin by making these three changes if your code is not already up to date with pygtk 2.24:

widget.window should be widget.get_window()
container.child should be container.get_child()
widget.flags() & gtk.REALIZED should be container.get_realized()

Here’s one the page doesn’t list.

get_size() calls need to become get_geometry()[2:4]

You can and should test these in place before going further.

Next, apply the pygi-convert.sh script. As the directions say, this does a surprisingly good job of source conversion. However, there is one place it seriously falls down. If you had an expose_event handler, the signature it requires has changed incomatibly and needs to be fixed.

The name of the signal has changed from "expose_event" to "draw". Where it used to take a Gdk event object it now takes a Cairo context. If you were counting on that event to give you the expose-area size, you lose. What you need to do is add another handler for the allocation event. That is, your code needs to go from a setup something like this:

    def __init__(self):
        GObject.GObject.__init__(self)
        self.connect('expose_event', self.expose_event)

    def expose_event(self, _unused, event, _empty=None):
        self.cr = self.window.cairo_create()
        self.cr.rectangle(
            event.area.x,
            event.area.y,
            event.area.width,
            event.area.height
        )
        self.cr.clip()

to something like this:

    def __init__(self):
        GObject.GObject.__init__(self)
        self.connect('size-allocate', self.on_size_allocate)
        self.width = self.height = 0
        self.connect('draw', self.draw)

    def on_size_allocate(self, _unused, allocation):
        self.width = allocation.width
        self.height = allocation.height

    def draw(self, _unused, _ctx):
        self.cr = self.get_window().cairo_create()
        self.cr.rectangle(0, 0, self.width, self.height)
        self.cr.clip()

Another, minor point is that while the pygi-convert.sh script will automatically set up a Gtk import for you, it doesn’t do likewise for Gdk. So if you have, say, gtk.gdk.color_parse() calls in your code, they’ll mutate to Gdk.color_parse() but you need to add "from gi.repository import Gdk" to the import list yourself.

All these changes can be tested before you move to Python 3.

Stay away from the email package

If you’re writing a systems programming tool, it probably doesn’t need to use Python’s email package; but you should be aware that the Python 3 version of this package insists on transforming your data in various ways to make it compliant with the email-related RFCs. The Python 2.7 version of this package, while it has a similar module layout, does not transform your data in the same way as the Python 3 version; so code that ran happily under Python 2.7 can break in various hard-to-debug ways under Python 3. If for some reason you need to use this package, proceed with extreme caution when trying to port your project to Python 3.

Be aware that you may need to roll your own RNG

Given the same seed value, the Mersenne Twister implementations in Python 2 and Python 3 will yield different sequences of pseudorandom numbers.

This means that if you have a program that relies on reproducability of random-number sequences from a seed, you will need to roll your own random number generator.

Steps to a working forward-port

Make your changes testable

First, have a decent regression- and functional-test suite for your program. If you don’t have one, write one now. This may sound like a step you can skip, but if you do…count on it that the Dread God Finagle and his mad prophet Murphy will cause you much more pain later in the process than you think you’re avoiding now.

Run through the porting-issues checklist

This procedure assumes you are starting with your shebang line as #!/usr/bin/env python2 so you nail down what version you are testing with.

The first step is to run through the porting-issues checklist in the previous section. Doing these changes while the code is still running under Python 2 should not cause any issues under Python 3, but to be extra sure you should finish this step by temporarily changing your shebang line to #!/usr/bin/env python3 and running your tests.

Fix up imports

Run 2to3 on your program, apply the patch it generates, and (this is important) partially revert what it does so the result runs correctly under Python 2.

For example, if your program uses the Python 2 ConfigParser library, 2to3 is going to change this to the Python 3 name configparser. What you need to do is yank that out and add this code snippet just after your general imports:

try:
    import configparser
except ImportError:
    import ConfigParser as configparser

Apply a similar pattern to all the other simple library name changes; try to import the Python 3 version, and if that fails substitute in the Python 2 version.

A similar piece of magic autoadapts to the name change between Python 2 raw_input and Python 3 input. The only difference here is that Python 2 also defines input as a builtin, so to avoid colliding with it, you have to pick your own name that will point to the right function in both Python versions, and write, for example

try:
    my_input = raw_input
except NameError:
    my_input = input

and then replace all your calls to raw_input with calls to my_input.

As another example, the intern builtin function in Python 2 becomes sys.intern in Python 3. So:

if not hasattr(sys, 'intern'):
    sys.intern = intern

It’s good practice to make names look like Python 3’s where possible, as in the above example; this will minimize code churn if you ever decide to leave Python 2 support behind. However, in some cases, the spelling you have to use to keep your code running on both Python 2 and Python 3 will look like the Python 2 spelling instead of the Python 3 one. For example:

try:
    xrange
except NameError:
    xrange = range

The xrange builtin exists in Python 2 and has the same behavior as the range builtin in Python 3; but you can’t use the range spelling because range also exists as a builtin in Python 2, but it returns a list, not an opaque, immutable sequence. A similar strategy can be used with generators:

if not hasattr(itertools, 'imap'):
    itertools.imap = map

if not hasattr(itertools, 'izip'):
    itertools.izip = zip

if not hasattr(itertools, 'ifilterfalse'):
    itertools.ifilterfalse = itertools.filterfalse

map and zip are builtins in Python 2 as well as Python 3, but in Python 2 they return lists, not generators. So you have to use the itertools spelling to get the generator behavior in both versions.

Sometimes you need to be a bit more fine-grained. For example, in Python 2 getstatusoutput() is a method of the commands library module; in Python 3 there is no commands and the method moves to subprocess. You can get around this by writing

# Warning: In some Python 3 versions getstatusoutput() returns
# status incorrectly so that a nonzero exit looks like the subprocess
# was signaled!  (Observed under 3.4.3; Debian bug #764848)
try:
    from subprocess import getstatusoutput
except ImportError:
    from commands import getstatusoutput

and then replacing all your commands.getstatusoutput calls with getstatusoutput.

Warning: In some Python 3 versions getstatusoutput() returns status incorrectly so that a nonzero exit looks like the subprocess was signaled! (Observed under 3.4.3; Debian bug #764848) It is likely this will not affect your program unless you are trying to distinguish between these cases.

Your objective at this stage is not yet to move fully to 3, so it’s possible you might need to back out some 2to3 patchbands that make incompatible changes relating to strings and unicode. Save these, you’ll want them for a later stage.

At the end of this step, you should have a kind of amphibian - a working 2.7 program, passing your regression tests, that does Python 3 imports when run under 3 and does binary (implicitly byte-buffer) I/O. However, this amphibian probably will not run correctly under Python 3.

The reason you wanted this as a separate step is so that you can do the next bit - actually making it run under 3 - with the serious string-vs.-unicode issues separated from the syntax tweaks and import munging that 2to3 does for you.

Fix up iterator methods

In Python 3, the next method of iterators is renamed to __next__. 2to3 does this renaming. For compatibility to Python 2, you should add a method alias

    next = __next__

immediately after each transformed __next__ definition, e.g.

    def next(self):
        example_iterator(self)

should become

    def __next__(self):
        example_iterator(self)
    next = __next__

Alternative: use a helper library

six

Six is a library that helps you write Python 2 / Python 3 compatible code.

It can take care of fixing the imports, using something like

from six.moves import input

It also allows you to use metaclasses both in Python 2 and Python 3, even if the syntax between the two is completely different.

Alternatives to six

pies is an alternative to six you may want to consider. See pie’s README on github for the details.
python-future is also interesting, since it contains tools that contrary to 2to3 will generate Python 2 / Python 3 compatible code directly.

Fix up string/unicode mixing

Now it’s time to tweak your shebang line to #!/usr/bin/env python3 and make that work.

This is going to consist mostly of adding encode() and decode() calls to change data between string and unicode types. This is the heavy lifting in your Python 3 port. Because of the ASCII-compatible, 0x80..0xff-preserving encoding you’ve chosen, these will be no-ops under Python 2.

The art here is in doing as little work as possible. Your encode() and decode() calls should intercept your binary I/O close to where it happens, so the bulk of your code is just seeing Unicode strings.

This is also the stage at which you may need to tag some literals with a prefix b for byte-buffer. Beware, if you have a lot of these it may mean you have not put encode/decode calls near enough to the natural choke points where your binary I/O is happening.

You may have to fix up string-to-byte concatenations as well. Again, you’ll minimize effort by moving these conversions as close to the I/O source or sink of the data as possible so that the interior computations are always done with Unicode strings.

Here is an error message you may see during conversion:

TypeError: str does not support the buffer interface: You handed a string to a function that was expecting a byte-buffer object. A common example of this is passing a string to the write method of a file object you opened in binary mode. To fix this, encode the string value to latin-1

It is worth noting that the above strategy, using encode() and decode() calls with no other checking, relies on two key properties of Python 2’s handling of byte-buffer and Unicode strings:

In string operations, you can mix byte-buffers and Unicode strings, and Python 2 will silently convert between them whenever it needs to. This allows the same code to run on both Python 2 and Python 3 without having to worry about the fact that under Python 2, a single operation might be mixing byte-buffers and Unicode strings (for example, calling format or using the % operator with a string literal as the format and data strings that are actually Unicode).
The str and unicode objects both have encode and decode methods (unlike Python 3, where only str has encode and only bytes has decode); a str encodes to itself, and a unicode decodes to itself. This allows you to avoid doing explicit isinstance checks to make sure you don’t call encode or decode on the wrong type of object; since your code will be mixing byte-buffers and Unicode strings, you won’t always be able to keep track of which type is being operated on at a particular point in your code.

If the above makes you nervous, however, there is a trick that avoids having to use Unicode at all under Python 2. Consider this snippet:

# Any encoding that preserves 0x80...0xff through round-tripping from byte
# streams to Unicode and back would do, latin-1 is the best known of these.

import io

binary_encoding = "latin-1"

if str is bytes:  # Python 2

    polystr = str
    polybytes = bytes
    polyord = ord
    polychr = str

else:  # Python 3

    def polystr(o):
        if isinstance(o, str):
            return o
        if isinstance(o, bytes):
            return str(o, encoding=binary_encoding)
        raise ValueError

    def polybytes(o):
        if isinstance(o, bytes):
            return o
        if isinstance(o, str):
            return bytes(o, encoding=binary_encoding)
        raise ValueError

    def polyord(c):
        "Polymorphic ord() function"
        if isinstance(c, str):
            return ord(c)
        else:
            return c

    def polychr(c):
        "Polymorphic chr() function"
        if isinstance(c, int):
            return chr(c)
        else:
            return c

    def polystream(stream):
        "Standard input/output wrapper factory function"
        # This ensures that the encoding of standard output and standard
        # error on Python 3 matches the binary encoding we use to turn
        # bytes to Unicode in polystr above

        # newline="\n" ensures that Python 3 won't mangle line breaks
        # line_buffering=True ensures that interactive command sessions work as expected
        return io.TextIOWrapper(
            stream.buffer, encoding=binary_encoding, newline="\n", line_buffering=True
        )

    sys.stdin = polystream(sys.stdin)
    sys.stdout = polystream(sys.stdout)
    sys.stderr = polystream(sys.stderr)

Under Python 2, str and bytes both refer to the same type object, so all that happens is that polystr and polybytes are aliased to that type object. But under Python 3, the polystr and polybytes functions are the equivalent of the encode and decode calls described above. So if you use polystr whenever you want to decode incoming data, and polybytes whenever you want to encode outgoing data, then under Python 2 your code will be using byte strings everywhere; it will only do Unicode conversions under Python 3. The only thing you need to decide is what to do if these functions receive an argument that isn’t a string at all. The above functions raise an exception, which is probably what you want if you want to make sure the functions only get used for the specific purpose of string data conversion. But there might be use cases where it makes sense to do something else.

(There are also polyord() and polychr() function in this wrapper; polyord() prevents the lossage that otherwise happens when calling ord() on an element of a byte buffer in Python 3, while polychr() prevents problems going in the opposite direction.)

Another item in this code snippet is worth noting: under Python 3, when constructing the alternate I/O streams (note that the above snippet doesn’t do anything to them under Python 2), you have to set the newline parameter to \n, as shown, or you will have problems with the way Python handles line breaks in your data. There are actually two issues here. The first is newline translation: by default, Python 3 opens text files in “universal newlines” mode, in which it automatically translates all non-Unix newline markers it finds (i.e., DOS-style \r\n newlines and MAC-style \r newlines) into its chosen newline marker for internal operations, which is the Unix newline, \n. Once the translation is done, there’s no way to recover the original newlines. Obviously you don’t want this default behavior.

You can stop Python from translating line breaks when reading files by passing any value for the newline parameter except None. However, when writing files, Python will translate newlines if you pass anything but a blank string '' or \n as the newline parameter. If you pass None, or accept Python’s default behavior, any \n characters will get translated to the system default line separator, os.linesep, on writing. If you pass the DOS or MAC newline, any \n characters will get translated to that newline. This behavior is rather counterintuitive; you might think that, if your data has all DOS newlines, you would want to tell Python that by passing newline="\r\n" when writing a file. In fact, what that will do is make Python translate every \r\n to \r\r\n when writing the file! This is because, when writing, Python just looks at the \n, interprets it as a newline (since that’s its internal newline character, as above), and translates it to \r\n.

Further, there is the second issue, which is string operations. If you pass Python anything but \n as the newline parameter to a file you open for reading, then line-related operations on that file, such as readlines() or for line in file, will break lines at markers other than \n. However, once the incoming data from that file is stored as a Unicode string, any line-related operations on that string, such as splitlines(), will only break lines at \n. So using anything other than \n as the newline parameter creates a mismatch between the way Python processes text files and the way it processes text strings.

You could conceivably try to work through all this, but it’s much better to just avoid the problem by using \n as the newline parameter for all files, and accepting that all your program’s internal data will be using \n as the newline marker, in accordance with Python’s internal data model.

Finally, if your program is going to be used interactively, you will want to set line_buffering=True, as shown in the code above, so that interactive sessions will work as expected.

Fix sort() calls

Remember that in Python 2 the sort() method takes a two-argument comparison function, but in Python 3 it takes a "sort key" function which must return a comparison value and be passed in as the value of the keyword argument "key=".

If you’re sorting numbers or strings or anything else for which there is a a strict ordering in Python, your key function can just be lambda: x x. (This is actually the default behavior if you don’t specify a key.)

If you’re sorting on tuples and want the usual behavior (earliest tuple member is most significant for sort value) remember that Python uses a stable sort, so sorting on each tuple member in turn will give the desired effect.

Let’s say you already had a sortvalue() function for the objects you’re sorting. Then polyglot code will, at worst, look like this:

try:
    sequence.sort(lambda x, y: sortvalue(x) - sortvalue(y))
except TypeError:
    sequence.sort(key=sortvalue)

Actually, since Python 2.4, the sort builtin accepts a named "key=" argument and "sequence.sort(key=sortvalue)" will do.

Fix dictionary views

Let’s say you have some code like this.

my_dict = { "a" : 1 }
keys = my_dict.keys()

By default, when you run 2to3, your code will be changed to:

my_dict = { "a" : 1 }
keys = list(my_dict.keys())

This is because in Python 3, keys() returns a dictionary view, which is different from the list you get in Python 2, and is also different from the iterator you get with iterkeys() on Python 2

But in most cases, you just want to iterate over the keys, so we recommend using 2to3 with --nofix=dict.

Be careful though, code will blow up if you have something like:

my_dict = { "a" : 1 }
keys = my_dict.keys()
keys.sort()

That’s because dictionary views do not have a sort() method.

Instead, write something like:

my_dict = { "a" : 1 }
keys = my_dict.keys()
keys = sorted(keys)

Another gotcha is when you change the dictionary:

for key in my_dict.keys():
    if something(key):
        del my_dict[key]

Here there’s no choice but converting to a list:

for key in list(my_dict.keys()):
    if something(key):
        del my_dict[key]

Fix up exception classes

In some older (pre-2.6) versions of Python 2, in order for a class to be thrown or caught as an exception value, it had to inherit from exceptions.Exception, and the exceptions module had to be imported to support that.

In Python 3, the exceptions module does not exist. Instead, exception classes must inherit from a builtin class called BaseException.

In Python 2 at versions 2.6 and later, it was possible to inherit from BaseException as in 3, but the older way of inheriting from exceptions.Exception also worked.

If you have exceptions in your imports remove it. Then, make your exception classes inherit from BaseException.

Wrapping up

Now you can hack the shebang line so it says #!/usr/bin/env python confident that the program will work whether the environment defaults to Python 2 or Python 3.

Final step: tweak your regression tests so that they run twice, once under python2 and once under Python 3. This way you’ll avoid unpleasant surprises during later modifications.

The setpython script can be some help with this:

#!/bin/sh
#
# setpython - create a local link from 'python' to a specified version
#
# This script is used to to redirect the 'python' in a shebang line to
# a specified version when running regression tests.

if [ -z "$1" ]
then
        ls -l python
elif [ $1 = "python" ]
then
        rm -f ./python
elif [ $1 = python2 -o $1 = python3 ]
then
        set -- `whereis $1`
        shift
        case $1 in
                */bin/*) ln -sf $1 ./python; echo "python -> $1";;
                *) echo "setpython: no python binary" >&2;;
        esac
else
        echo "setpython: unrecognized python version" >&2
fi

This helps you redirect a #!/usr/bin/env python shebang to your choice of a Python 2 or 3 version without actually modifying the Python code. To use it, ensure that the $PATH seen by your test scripts includes . (the current directory) so ./python will intercept the python invocation in the shebang.

Worked examples

In the SRC repository, look at Peter Donis’s commits on 2016-02-16 and 2016-02-17. A second series by ESR (but based on Peter’s theory) adds the high-byte-preserving Latin-1 encoding for sys error streams, on 2016-02-21 just following the 1.10 tag.

Peter did most of this port while we were still working out how to systematize the process and exploit Latin-1, so the sequence of steps does not exactly match the ideal version described in the previous section.

In the reposurgeon repository, the commits beginning with “Remove some Python 2 functions that aren’t in Python 3.” on 2016-02-23 and ending with “Cleanup changes for the Python 3 port.” on 2016-02-24 are effectively all of the Python 3 port changes. There was a small amount of preparation before these and some tying up of loose ends afterward, but they show the major porting steps you will have to do pretty well.

In the deheader repository, you can see an example of how much smaller the polyglot changes get when you don’t have binary data to worry about. Look at the four changesets following the 1.3 tag. The only serious issue here was that an iteration over a dictionary had to be replaced by an iteration over its sorted keys, because the dictionary traversal order in 2 and 3 is different. For safety, the sort used in directory entries was also strengthened a little.

In the GPSD repository, the xgpsspeed test client is a worked example of the Gtk2 to Gtk3 port procedure. Most of the work is done in "Eliminate use of event argument in the xgpsspeed draw handler." (2016-03-25T00:51:50) and "xgpsspeed successfully ported to python-gi" (2016-03-25T01:03:35); the second one is where pygi-convert.sh was applied. A bit of cleanup follows.

References

Processing Text Files in Python 3

Python Unicode HOWTO

The ISO 8859 Series

Porting Python 2 Code to Python 3

PEP 393

Improving this HOWTO

If, when you apply this method, you solve a problem we haven’t covered, please tell us about it so we can improve the HOWTO.

The master of this document is hosted on gitlab at

git@gitlab.com:esr/practical-python-porting.git

The Python code snippets for autoadapting to Python 2 or 3 are available as standalone code files in that repository.

Change history

For fine-grained changes, look in the repository history.

1.0: Initial release.
1.1: Explain how to fix up sort calls and exception-class declarations.
1.2: Add a checklist item on functional print. Revised the section on exception classes. Substantial new material on division, exceptions, dictionary views, and other topics by Dimitri Merejkowsky.
1.3: Add deheader as a simpler example without binary-data problems.
1.4: Add a warning about buggy behavior of getstatusoutput() in Python 3.
1.5: Add advice to import float division - unconditionally.
1.6: Details on porting from pygtk/gobject for Gtk2 to python-GI for Gtk3.
1.7: Added "Fix up iterator methods".
1.8: Added polyord() and polychr() to polystr-inclusion.py.
1.9: How to fix up raise calls. Simpler sort porting.
1.10: Replacing use of string_escape codec.
1.11: Pickle file opens must be binary. What to do about reduce and StringIO.
1.12: Reference PEP394. Rolling your own RNG may be necessary