Surprising Python

I haven’t written anything on Python here in a good while.  But that doesn’t mean I haven’t been busy wrestling with it.  I’ll need to take a look at my Perforce changelists over the last months and take stock.  In the meantime, I’d like to rant a bit about a most curious peculiarity of Python that I came across  a while back.

Continue reading

Idioms for proxy function interfaces

At PyCon 2013 I saw a presentation, with a common function signature:

def call_later(when, function, *args):

This got me thinking about some guidelines I wrote recently on our internal tech blog about how to write such proxy functions. The current recommendation I have is for a different signature, for the reason I shall now explain:

Let’s say that you have a function that calls another function for some reason. You start with something like this:

def mywrapper(func, *args, **kwargs):
   return func(*args, **args)

At some point though, you add another higher level wrapper:

def mybigwrapper(func, *args, **kwargs):
   return mywraper(func, *args, **args)

This is ok, until someone notices that this is rather slow. The reason is, that arguments are constantly being packed and unpacked. Unnecessarily so, because no one is really looking at them. So a clever software engineer comes up with a solution:

def mywrapper(func, *args, **kwargs):
   return mywrapper_without_the_stars(args, args)

def mywrapper_without_the_stars(func, args, kwargs):
   return func(*args, **args)

def mybigwrapper(func, *args, **kwargs):
   return mywraper_without_the_stars(func, args, args)

What has happened? Yes, we have created a set of functions that do not take variable arguments, but rather just take the argument tuple and keyword dict. When you nest a number of those, there is no argument packing and unpacking going on and they are all passed through verbatim. We then have a thin layer outside that does the argument packing, for api backwards compatibility.

But there is a lesson here: Perhaps it is not such a good idea to do this style of interface in the first place. Why didn’t we just write:

def mywrapper(func, args=(), kwargs={}):
   return func(*args, **args)

to begin with? In my opinion, this is actually a much better interface. To illustrate, lets say that we want to wrap a call to myfunc(1,2,3). Compare these two styles:

return mywrapper(myfunc, 1, 2, 3)


return mywrapper(myfunc, (1, 2, 3))

In the former case, we are mixing the callable (myfunc) and its arguments (1, 2, 3) into one big list. This doesn’t really make the distinction that “myfunc” is the callable and “1” is its first argument, but rather they look semantically to be equivalent, as if they were all just a chunk of arguments. In my opinion it is much clearer, when using this sort of proxy functions, to make a distinction between the callable and its arguments.
Therefore, this is currently the recommended way within CCP to write such wrappers. They take the argument tuple (and keyword dict) as a non-variable argument to the function.  Variable argument lists are only used in two cases:

  1. When writing a function where that is appropriate, such as logging functions
  2. When writing wrapper functions that emulat other function’s signature.

But recently, I have been thinking even more about this because passing around “args” and “kwargs” everywhere seems unnecessarily clunky. And we arrive at the thesis of this blog post:

Wrapper functions should be written and used like this:

# wrapper takes an argument-less callable
def mywrapper(func):
   return func()

# call myfunc with default args
a = mywrapper(myfunc)

# call myfunc with some arguments
a = mywrapper(lambda : myfunc(1, 2, 3))

# call myfunc with something from this context
def call():
    return myfunc(foo, bar)
a = mywrapper(call)

In other words: How about using Python’s powerful lambda and closure semantics to add those arguments if and when they are needed, rather than to write layer upon layer of functions that manually carry around argument tuples and keyword dicts?


After a long hiatus, the Cosmic Percolator is back in action.  Now it is time to rant about all things Python, I think.  Let’s start with this here, which came out from work I did last year.

Stackless has had an “atomic” feature for a long time. In this post I am going to explain its purpose and how I reacently extended it to make working with OS threads easier.


In Stackless python, scheduling it cooperative.  This means that a tasklet is normally uninterrupted until it explicitly does something that would cause another one to run, like sending a message over a channel.  This allows one to write logic in stackless without worrying too much about synchronization.

However, there is an important exception to this: It is possible to run stackless tasklets throught the watchdog and this will interrupt a running tasklet if it exceeds a pre-determined number of executed opcodes:

while True:
    interrupted =
    if interrupted:
        print interrupted, "has been running quite a bit!"
        break # Ok, nothing runnable anymore

This code may cause a tasklet to be interrupted at an arbitrary point (actually during a tick interval, the same point that yields the GIL) and cause a switch to the main tasklet.

Of course, not all code uses this execution mode, but never the less, it has always been considered a good idea to be aware of this.  For this reason, an atomic mode has been supported which would inhibit this involuntary switching in sensitive areas:

oldvalue = stackless.getcurrent().set_atomic(1)
    myglobalvariable1 += 1
    myglobalvariable2 += 2

The above is then optionally wrapped in a context manager for readability:

def atomic()
    oldv = stackless.getcurrent().set_atomic(1)
        yield None

the atomic state is a property of each tasklet and so even when there is voluntary switching performed while a non-zero atomic state is in effect, it has no effect on other tasklets.  Its only effect is to inhibit involuntary switching of the tasklet on which it is set.

A Concrete Example

To better illustrate its use, lets take a look at the implementation of the Semaphore from stacklesslib (stacklesslib.locks.Semaphore):

class Semaphore(LockMixin):
    def __init__(self, value=1):
        if value < 0:
            raise ValueError
        self._value = value
        self._chan =

    def acquire(self, blocking=True, timeout=None):
        with atomic():
            # Low contention logic: There is no explicit handoff to a target,
            # rather, each tasklet gets its own chance at acquiring the semaphore.
            got_it = self._try_acquire()
            if got_it or not blocking:
                return got_it

            wait_until = None
            while True:
                if timeout is not None:
                    # Adjust time.  We may have multiple wakeups since we are a
                    # low-contention lock.
                    if wait_until is None:
                        wait_until = elapsed_time() + timeout
                        timeout = wait_until - elapsed_time()
                        if timeout < 0:
                            return False
                    lock_channel_wait(self._chan, timeout)
                if self._try_acquire():
                    return True

    def _try_acquire(self):
        if self._value > 0:
            self._value -= 1
            return True
        return False

This code illustrates how the atomic state is incremented (via a context manager) and kept non-zero while we are doing potentially sensitive things, in this case, doing logic based on self._value. Since this is code that is used for implementing a Semaphore, which itself forms the basis of other stacklesslib.locks objects such as CriticalSection and Condition objects, this is the only way we have to ensure atomicity.


It is worth noting that using the atomic property has largely been confined to such library code as the above. Most stackless programs indeed do not run the watchdog in interruptible mode, or they use the so-called soft-interrupt mode which breaks the scheduler only at the aforementioned voluntary switch points.

However, in the last two years or so, I have been increasingly using Stackless Python in conjunction with OS threads.  All the stackless constructs, such as channels and tasklets work with threads, with the caveat that synchronized rendezvous isn’t possible between tasklets of different threads.  A channel.send() where the recipient is a tasklet from a different thread from the sender will always cause the target to become runnable in that thread, rather than to cause immediate switching.

Using threads has many benefits.  For one, it simplifies certain IO operations.  Handing a job to a tasklet on a different thread won’t block the main thread.  And using the usual tasklet communication channels to talk uniformly to all tasklets, whether they belong to this thread or another, makes the architecture uniform and elegant.

The locking constructs in stacklesslib also all make use of non-immediate scheduling.  While we use the object to wait, we make no assumptions about immediate execution when a target is woken up.  This makes them usable for synchronization between tasklets of different threads.

Or, this is what I thought, until I started getting strange errors and realized that tasklet.atomic wasn’t inhibiting involuntary switching between threads!


You see, Python internally can arbitrarily stop executing a particular thread and start running another.  This is called yielding the GIL and it happens at the same part in the evaluation loop as that involuntary breaking of a running tasklet would have been performed.  And stackless’ atomic property din’t affect this behaviour.  If the python evaluation loop detects that another thread is runnable and waiting to execute python code, it may arbitrariliy yield the GIL to that thread and wait to reacquire the GIL again.

When using the above lock to synchronize tasklets from two threads, we would suddenly have a race condition, because the atomic context manager would no longer prevent two tasklets from making simultaneous modifications to self._value, if those tasklets came belonged to different threads.

A Conundrum

So, how to fix this?  An obvious first avenue to explore would be to use one of the threading locks in addition to the atomic flag.  For the sake of argument, let’s illustrate with a much simplified lock:

class SimpleLock(object):
    def __init__(self):
        self._chan =
        self._chan.preference = 0 # no preference, receiver is made runnable
        self._state = 0

    def acquire(self):
        # oppertunistic lock, without explicit handoff.
        with atomic():
            while True:
                if self._state == 0:
                    self._state = 1:
    def release():
        with atomic():
            self._state == 0
            if self._chan.balance():
                self._chan.send(None) # Wake up someone who is waiting

While this lock will work nicely with tasklets on the same thread. But when we try to use it for locking between two threads, the atomicity of changing self._state and examining self._chan.balance() won’t be maintained.

We can try to fix this with a proper thread lock:

class SimpleLockedLock(object):
    def __init__(self):
        self._chan =
        self._chan.preference = 0 # no preference, receiver is made runnable
        self._state = 0
        self._lock = threading.Lock()

    def acquire(self):
        # oppertunistic lock, without explicit handoff.
        with atomic():
            while True:
                with self._lock:
                    if self._state == 0:
                        self._state = 1:
    def release():
        with atomic():
            with self._lock:
                self._state == 0
                if self._chan.balance():
                    self._chan.send(None) # Wake up someone who is waiting

This version is more cumbersome, of course, but the problem is, that it doesn’t really fix the issue. There is still a race condition in acquire(), between relesing self._lock and calling self._chan.receive().

Even if we were to modify self.chan.receive() to take a lock and atomically release it before blocking, and reaquire it before returning, that would be a very unsatisfying solution.

thankfully, since we needed to go and modify Stackless Python anyway, there was a much simpler solution.

Fixing Atomic

You see, Python is GIL synchronized.  In the same way that only one tasklet of a particular thread is executing at the same time,  then regular cPython is has the GIL property that only one of the processes thread is runinng python code at a time.  So, at any one time, only one tasklet of one thread is running python code.

So, if atomic can inhibit involuntary switching between tasklets of the same threads, can’t we just extend it to inhibit involuntary switching between threads as well?  Jessörry Bob, it turns out we can.

This is the fix (ceval.c:1166, python 2.7):

/* Do periodic things.  Doing this every time through
the loop would add too much overhead, so we do it
only every Nth instruction.  We also do it if
``pendingcalls_to_do'' is set, i.e. when an asynchronous
event needs attention (e.g. a signal handler or
async I/O handler); see Py_AddPendingCall() and
Py_MakePendingCalls() above. */
/* don't do periodic things when in atomic mode */
if (--_Py_Ticker < 0 && !tstate->st.current->flags.atomic) {
if (--_Py_Ticker < 0) {

That’s it! Stackless’ atomic flag has been extended to also stop the involuntary yielding of the GIL from happening.  Of course voluntary yielding, such as that which is done when performing blocking system calls, is still possible, much like voluntary switching between tasklets is also possible.  But when the tasklet’s atomic value is non-zero, this guarantees that no unexpected switch to another tasklet, be it on this thread or another, happens.

This fix, dear reader, was sufficient to make sure that all the locking constructs in stacklesslib worked for all tasklets.

So, what about cPython?

It is worth noting that the locks in stacklesslib.locks can be used to replace the locks in threading.locks:  If your program is just a regular threaded python program, then it will run correctly with the locks from stacklesslib.locks replacing the ones in threading.locks.  This includes, Semaphore, Lock, RLock, Condition, Barrier, Event and so on.  and all of them are now written in Python-land using regular Python constructs and made to work by the grace of the extended tasklet.atomic property.

Which brings me to ask the question: Why doesn’t cPython have the thread.atomic property?

I have seen countless questions on the python-dev mailing lists about whether this or that operation is atomic or not.  Regularly one sees implementation changes to for example list and dict operations to add a new requirement that an operation be atomic wrt. thread switches.

Wouldn’t it be nice if the programmer himself could just say: “Ah, I’d like to make sure that my updating this container here will be atomic when seen from the other threads.  Let’s just use the thread.atomic flag for that.”

For cPython, this would be a perfect light-weight atomic primitive.  It would be very useful to synchronize access to small blocks of code like this.  For other implementations of Python, those that are truly GIL free, a thread.atomic property could be implemented with a single system global threading.RLock. Provided that we add the caveat to a thread.atomic that it should be used by all agents accessing that data, we would now have a system for mutual access that wold work very cheaply on cPython and also work (via a global lock) on other implementations.

Let’s add thread.atomic to cPython

The reasons I am enthusiastic about seeing an “atomic” flag as part of cPython are twofold:

  1. It would fill the role of a lightweight synchronization primitive that people are requesting where a true Lock is considered too expensive, and where it makes no sense to have a per-instance lock object.
  2. More importantly, it will allow Stackless functionality to be added to cPython as a pure extension module, and it will allow such inter-thread operations to be added to Greenlet-based programs in the same way as we have solved the problem for Stackless Python.
  3. And thirdly?  Because Debbie Harry says so:

 Update, 23.03.2013:

Emulating an “atomic” flag in an truly multithreaded environment with a lock is not as simple as I first though.  The cool thing about “atomic” is that it still allows the thread to block, e.g. on an IO operation, without affecting other threads.  For an atomic-like lock to work, such a lock would need to be automatically yielded and re-acquired when blocking, bringing us back to a condition-variable-like model.  Since the whole purpose of “atomic” is to be lightweight in a GIL-like environment, forcing it to be backwards compatible with a truly multi-threaded solution is counter-productive.  So, “atomic” as a GIL only feature is the only thing that makes sense, for now.  Unless I manage to dream up an alternative.

Blog moved

So! My previous blog (at disappeared.  It has taken this long for me to get things back up and running.  The previous blog was on a private server run by an external party and it was compromised by one of those sneaky internet maladies that infect sites like these.

I’m in the process of salvaging all the posts from there and get them running here.  This is a somewhat painstaking process.  I was provided with some files in a folder and I have had to re-learn unix (Its been 10 years since I used that regularly), learn about Apache, WordPress, mysql, multi-site installs and all kinds of things.  10 years ago, they didn’t have sudo.  I’m not sure that I like it.

Anyway, I hope to finish this soon and start blogging again about my adventures in Python. PyCon 2013 is coming up and I have, as always, some ideas to put forward and Swiss army knives to grind.

Zombieframes. A gratuitous optimization?

Examing a recent crash case, I stumbled across this code in frameobject.c:

PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
if (code->co_zombieframe != NULL) {
f = code->co_zombieframe;
code->co_zombieframe = NULL;
_Py_NewReference((PyObject *)f);
assert(f->f_code == code);

Intrigued by the name, I examined the header where it is defined, code.h:

void *co_zombieframe; /* for optimization only (see frameobject.c) */
} PyCodeObject;

It turns out that for every PyCodeObject object that has been executed, a PyFrameObject of a suitable size is cached and kept with the code object. Now, caching is fine and good, but this cache is unbounded. Every code object has the potential to hang on to a frame, which may then never be released.
Further, there is a separate freelist cache for PyFrameObjects already, in case a frame is not found on the code object:

if (free_list == NULL) {
f = PyObject_GC_NewVar(PyFrameObject, &PyFrame_Type,
if (f == NULL) {
return NULL;
else {
assert(numfree > 0);
f = free_list;
free_list = free_list->f_back;

Always concious about memory these days, I tried disabling this in version 3.3 and running the pybench test. I was not able to see any conclusive difference in execution speed.


Disabling the zombieframe on the PS3 shaved off some 50k on startup.  Not the jackpot, but still, small things add up.

* using CPython 3.3.0a3+ (default, May 23 2012, 20:02:34) [MSC v.1600 64 bit (AMD64)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=2.9680909446810176e-07, implementation=QueryPerformanceCounter()

Benchmark: nozombie

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:02:34 (#default)
Unicode: UCS4

Comparing with: zombie

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:00:42 (#default)
Unicode: UCS4

Test minimum run-time average run-time
this other diff this other diff
BuiltinFunctionCalls: 51ms 52ms -3.3% 52ms 53ms -2.0%
BuiltinMethodLookup: 33ms 33ms +0.0% 34ms 34ms +0.8%
CompareFloats: 50ms 50ms +0.1% 50ms 50ms +0.4%
CompareFloatsIntegers: 99ms 98ms +0.8% 99ms 99ms +0.6%
CompareIntegers: 77ms 77ms -0.5% 77ms 77ms -0.3%
CompareInternedStrings: 60ms 60ms +0.0% 61ms 61ms -0.1%
CompareLongs: 46ms 45ms +1.5% 46ms 45ms +1.2%
CompareStrings: 61ms 59ms +3.6% 61ms 59ms +3.6%
ComplexPythonFunctionCalls: 60ms 58ms +3.3% 60ms 58ms +3.2%
ConcatStrings: 48ms 47ms +2.4% 48ms 47ms +2.1%
CreateInstances: 58ms 57ms +1.3% 59ms 58ms +1.3%
CreateNewInstances: 43ms 43ms +1.1% 44ms 44ms +1.1%
CreateStringsWithConcat: 79ms 79ms -0.3% 79ms 79ms -0.1%
DictCreation: 71ms 71ms +0.4% 72ms 72ms +1.0%
DictWithFloatKeys: 72ms 70ms +2.1% 72ms 71ms +1.8%
DictWithIntegerKeys: 46ms 46ms +0.7% 46ms 46ms +0.4%
DictWithStringKeys: 41ms 41ms +0.0% 41ms 41ms -0.1%
ForLoops: 35ms 37ms -4.0% 35ms 37ms -4.0%
IfThenElse: 64ms 64ms -0.1% 64ms 64ms -0.4%
ListSlicing: 49ms 50ms -1.0% 53ms 53ms -0.8%
NestedForLoops: 54ms 51ms +6.7% 55ms 51ms +6.7%
NestedListComprehensions: 54ms 54ms -0.7% 54ms 55ms -2.2%
NormalClassAttribute: 94ms 94ms +0.1% 94ms 94ms +0.1%
NormalInstanceAttribute: 54ms 54ms +0.3% 54ms 54ms +0.2%
PythonFunctionCalls: 58ms 57ms +0.8% 58ms 58ms +0.6%
PythonMethodCalls: 65ms 61ms +6.3% 66ms 62ms +5.9%
Recursion: 84ms 85ms -1.0% 85ms 85ms -0.9%
SecondImport: 74ms 76ms -2.5% 74ms 77ms -3.5%
SecondPackageImport: 75ms 78ms -3.8% 76ms 79ms -3.9%
SecondSubmoduleImport: 163ms 169ms -3.4% 164ms 170ms -3.3%
SimpleComplexArithmetic: 43ms 43ms +1.0% 43ms 43ms +1.0%
SimpleDictManipulation: 80ms 78ms +2.2% 81ms 79ms +2.4%
SimpleFloatArithmetic: 42ms 42ms +0.1% 42ms 42ms -0.0%
SimpleIntFloatArithmetic: 52ms 53ms -1.2% 52ms 53ms -1.1%
SimpleIntegerArithmetic: 52ms 52ms -0.7% 52ms 53ms -0.8%
SimpleListComprehensions: 45ms 45ms -0.2% 45ms 45ms +0.3%
SimpleListManipulation: 44ms 46ms -4.0% 44ms 46ms -3.9%
SimpleLongArithmetic: 32ms 32ms -0.9% 32ms 32ms -0.1%
SmallLists: 58ms 57ms +1.2% 58ms 67ms -12.8%
SmallTuples: 64ms 65ms -0.5% 65ms 65ms -0.2%
SpecialClassAttribute: 148ms 149ms -0.8% 149ms 150ms -1.0%
SpecialInstanceAttribute: 54ms 54ms +0.2% 54ms 54ms +0.0%
StringMappings: 120ms 117ms +2.5% 120ms 117ms +2.5%
StringPredicates: 62ms 62ms +0.9% 62ms 62ms +1.0%
StringSlicing: 69ms 68ms +1.6% 69ms 68ms +2.1%
TryExcept: 37ms 37ms +0.0% 37ms 37ms +0.5%
TryFinally: 40ms 37ms +6.7% 40ms 37ms +6.5%
TryRaiseExcept: 19ms 20ms -1.0% 20ms 20ms -0.4%
TupleSlicing: 65ms 65ms +0.5% 66ms 65ms +1.2%
WithFinally: 57ms 56ms +1.9% 57ms 56ms +2.1%
WithRaiseExcept: 53ms 53ms +0.3% 54ms 54ms -0.8%
Totals: 3154ms 3145ms +0.3% 3176ms 3177ms -0.0%

(this=nozombie, other=zombie)

I’m going to remove this weird, unbounded cache from the python interpreter we use on the PS3.

Killing a Stackless bug

What follows is an account of how I found and fixed an insidious bug in Stackless Python which has been there for years.  It’s one of those war stories.  Perhaps a bit long winded and technical and full of exaggerations as such stories tend to be.


Some weeks ago, because of a problem in the client library we are using, I had to switch the http library we are using on the PS3 from using non-blocking IO to blocking. Previously, we were were issuing all the non-blocking calls, the “select” and the tasklet blocking / scheduling on the main thread. This is similar to how gevent and other such libraries do things. Switching to blocking calls, however, meant doing things on worker threads.

The approach we took was to implement a small pool of pyton workers which could execute arbitrary jobs. A new utility function, stacklesslib.util.call_async() then performed the asynchronous call by dispatching it to a worker thread. The idea of an call_async() is to have a different tasklet execute the callable while the caller blocks on a channel. The return value, or error, is then propagated to the originating tasklet using that channel. Stackless channels can be used to communicate between threads too. And synchronizing threads in stackless is even more conveninent than regular Python because there is stackless.atomic, which not only prevents involuntary scheduling of tasklets, it also prevents automatic yielding of the GIL (cPython folks, take note!)

This worked well, and has been running for some time. The drawback to this approach, of course, is that we now need to keep python threads around, consuming stack space. And Python needs a lot of stack.

The problem

The only problem was, that there appeared to be a bug present. One of our developers complained that sometimes, during long downloads, the http download function would return None, rather than the expected string chunk.

Now, this problem was hard to reproduce. It required a specific setup and geolocation was also an issue. This developer is in California, using servers in London. Hence, there ensued a somewhat prolonged interaction (hindered by badly overlapping time-zones) where I would provide him with modified .py files with instrumentation, and he would provide me with logs. We quickly determined, to my dismay, that apparently, sometimes a string was turning into None, while in transit trough a channel.send() to a channel.receive(). This was most distressing. Particularly because the channel in question was transporting data between threads and this particular functionality of stackless has not been as heavily used as the rest.

Tracking it down

So, I suspected a race condition of some sorts. But a careful review of the channel code and the scheduling code presented no obvious candidates. Also, the somehwat unpopular GIL was being used throughout, which if done correctly ensures that things work as expected.

To cut a long story short, by a lucky coincidence I managed to reproduce a different manifestation of the problem. In some cases, a simple interaction with a local HTTP server would cause this to happen.

When a channel sends data between tasklets, it is temporarily stored on the target tasklet’s “tempval” attribute. When the target wakes up, this is then taken and returned as the result from the “receive()” call. I was able to establish that after sending the data, the target tasklet did indeed hold the correct string value in its “tempval” attribute. I then needed to find out where and why it was disappearing from that place.

By adding instrumentation code to the stackless core, I established that this was happening in the last line of the following snippet:

PyObject *
    PyThreadState *ts = PyThreadState_GET();
    PyObject *retval;

    if ( (ts->st.main == NULL) && initialize_main_and_current()) {
        ts->frame = NULL;
        return NULL;

    TASKLET_CLAIMVAL(ts->st.current, &retval);

By setting a breakpoint, I was able to see that I was in the top level part of the “continue” bit of the “stack spilling” code

Stack spilling is a feature of stackless where the stack slicing mechanism is used to recycle a deep callstack. When it detects that the stack has grown beyond a certain limit, it is stored away, and a hard switch is done to the top again, where it continues its downwards crawl. This can help conserve stack address space, particularly on threads where the stack cannot grow dynamically.

So, something wrong with stack spilling, then.  But even so, this was unexpected. Why was stack spilling happening when data was being transmitted across a channel? Stack spilling normally occurs only when nesting regular .py code and other such things.

By setting a breakpoint at the right place, where the stack spilling code was being invoked, I finally arrived at this callstack:

Type Function
PyObject* slp_eval_frame_newstack(PyFrameObject* f, int exc, PyObject* retval)
PyObject* PyEval_EvalFrameEx_slp(PyFrameObject* f, int throwflag, PyObject* retval)
PyObject* slp_frame_dispatch(PyFrameObject* f, PyFrameObject* stopframe, int exc, PyObject* retval)
PyObject* PyEval_EvalCodeEx(PyCodeObject* co, PyObject* globals, PyObject* locals, PyObject** args, int argcount, PyObject** kws, int kwcount, PyObject** defs, int defcount, PyObject* closure)
PyObject* function_call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_Call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_CallFunctionObjArgs(PyObject* callable)
void PyObject_ClearWeakRefs(PyObject* object)
void tasklet_dealloc(PyTaskletObject* t)
void subtype_dealloc(PyObject* self)
int slp_transfer(PyCStackObject** cstprev, PyCStackObject* cst, PyTaskletObject* prev)
PyObject* slp_schedule_task(PyTaskletObject* prev, PyTaskletObject* next, int stackless, int* did_switch)
PyObject* generic_channel_action(PyChannelObject* self, PyObject* arg, int dir, int stackless)
PyObject* impl_channel_receive(PyChannelObject* self)
PyObject* call_function(PyObject*** pp_stack, int oparg)

Notice the “subtype_dealloc”. This callstack indicates that in the channel receive code, after the hard switch back to the target tasklet, a Py_DECREF was causing side effects, which again caused stack spilling to occur. The place was this, in slp_transfer():

/* release any objects that needed to wait until after the switch. */

This is code that does cleanup after tasklet switch, such as releasing the last remaining reference of the previous tasklet.

So, the bug was clear then. It was twofold:

  1. A Py_CLEAR() after switching was not careful enough to store the current tasklet’s “tempval” out of harms way of any side-effects a Py_DECREF() might cause, and
  2. Stack slicing itself, when it happened, clobbered the current tasklet’s “tempval”

The bug was subsequently fixed by repairing stack spilling and spiriting “tempval” away during the Py_CLEAR() call.

Post mortem

The inter-thread communication turned out to be a red herring. The problem was caused by an unfortunate juxtaposition of channel communication, tasklet deletion, and stack spilling.
But why had we not seen this before? I think it is largely due to the fact that stack spilling only rarely comes into play on regular platforms. On the PS3, we deliberately set the threshold low to conserve memory space. This is also not the first stack-spilling related bug we have seen on the PS3, but the first one for two years. Hopefully it will be the last.

Since this morning, the fix is in the stackless repository at