Optimizing Python Condition Variables with Telemetry

Working in a games company often allows you to work with the coolest toys.  One of these is a product caled Telemetry from Rad Game Tools.
Continue reading

Advertisements

Zombieframes. A gratuitous optimization?

Examing a recent crash case, I stumbled across this code in frameobject.c:

PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
...
if (code->co_zombieframe != NULL) {
f = code->co_zombieframe;
code->co_zombieframe = NULL;
_Py_NewReference((PyObject *)f);
assert(f->f_code == code);
}

Intrigued by the name, I examined the header where it is defined, code.h:

...
void *co_zombieframe; /* for optimization only (see frameobject.c) */
...
} PyCodeObject;

It turns out that for every PyCodeObject object that has been executed, a PyFrameObject of a suitable size is cached and kept with the code object. Now, caching is fine and good, but this cache is unbounded. Every code object has the potential to hang on to a frame, which may then never be released.
Further, there is a separate freelist cache for PyFrameObjects already, in case a frame is not found on the code object:

if (free_list == NULL) {
f = PyObject_GC_NewVar(PyFrameObject, &PyFrame_Type,
extras);
if (f == NULL) {
Py_DECREF(builtins);
return NULL;
}
}
else {
assert(numfree > 0);
--numfree;
f = free_list;
free_list = free_list->f_back;
...

Always concious about memory these days, I tried disabling this in version 3.3 and running the pybench test. I was not able to see any conclusive difference in execution speed.

Update:

Disabling the zombieframe on the PS3 shaved off some 50k on startup.  Not the jackpot, but still, small things add up.

——————————————————————————-
PYBENCH 2.1
——————————————————————————-
* using CPython 3.3.0a3+ (default, May 23 2012, 20:02:34) [MSC v.1600 64 bit (AMD64)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=2.9680909446810176e-07, implementation=QueryPerformanceCounter()

——————————————————————————-
Benchmark: nozombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:02:34 (#default)
Unicode: UCS4

——————————————————————————-
Comparing with: zombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:00:42 (#default)
Unicode: UCS4

Test minimum run-time average run-time
this other diff this other diff
——————————————————————————-
BuiltinFunctionCalls: 51ms 52ms -3.3% 52ms 53ms -2.0%
BuiltinMethodLookup: 33ms 33ms +0.0% 34ms 34ms +0.8%
CompareFloats: 50ms 50ms +0.1% 50ms 50ms +0.4%
CompareFloatsIntegers: 99ms 98ms +0.8% 99ms 99ms +0.6%
CompareIntegers: 77ms 77ms -0.5% 77ms 77ms -0.3%
CompareInternedStrings: 60ms 60ms +0.0% 61ms 61ms -0.1%
CompareLongs: 46ms 45ms +1.5% 46ms 45ms +1.2%
CompareStrings: 61ms 59ms +3.6% 61ms 59ms +3.6%
ComplexPythonFunctionCalls: 60ms 58ms +3.3% 60ms 58ms +3.2%
ConcatStrings: 48ms 47ms +2.4% 48ms 47ms +2.1%
CreateInstances: 58ms 57ms +1.3% 59ms 58ms +1.3%
CreateNewInstances: 43ms 43ms +1.1% 44ms 44ms +1.1%
CreateStringsWithConcat: 79ms 79ms -0.3% 79ms 79ms -0.1%
DictCreation: 71ms 71ms +0.4% 72ms 72ms +1.0%
DictWithFloatKeys: 72ms 70ms +2.1% 72ms 71ms +1.8%
DictWithIntegerKeys: 46ms 46ms +0.7% 46ms 46ms +0.4%
DictWithStringKeys: 41ms 41ms +0.0% 41ms 41ms -0.1%
ForLoops: 35ms 37ms -4.0% 35ms 37ms -4.0%
IfThenElse: 64ms 64ms -0.1% 64ms 64ms -0.4%
ListSlicing: 49ms 50ms -1.0% 53ms 53ms -0.8%
NestedForLoops: 54ms 51ms +6.7% 55ms 51ms +6.7%
NestedListComprehensions: 54ms 54ms -0.7% 54ms 55ms -2.2%
NormalClassAttribute: 94ms 94ms +0.1% 94ms 94ms +0.1%
NormalInstanceAttribute: 54ms 54ms +0.3% 54ms 54ms +0.2%
PythonFunctionCalls: 58ms 57ms +0.8% 58ms 58ms +0.6%
PythonMethodCalls: 65ms 61ms +6.3% 66ms 62ms +5.9%
Recursion: 84ms 85ms -1.0% 85ms 85ms -0.9%
SecondImport: 74ms 76ms -2.5% 74ms 77ms -3.5%
SecondPackageImport: 75ms 78ms -3.8% 76ms 79ms -3.9%
SecondSubmoduleImport: 163ms 169ms -3.4% 164ms 170ms -3.3%
SimpleComplexArithmetic: 43ms 43ms +1.0% 43ms 43ms +1.0%
SimpleDictManipulation: 80ms 78ms +2.2% 81ms 79ms +2.4%
SimpleFloatArithmetic: 42ms 42ms +0.1% 42ms 42ms -0.0%
SimpleIntFloatArithmetic: 52ms 53ms -1.2% 52ms 53ms -1.1%
SimpleIntegerArithmetic: 52ms 52ms -0.7% 52ms 53ms -0.8%
SimpleListComprehensions: 45ms 45ms -0.2% 45ms 45ms +0.3%
SimpleListManipulation: 44ms 46ms -4.0% 44ms 46ms -3.9%
SimpleLongArithmetic: 32ms 32ms -0.9% 32ms 32ms -0.1%
SmallLists: 58ms 57ms +1.2% 58ms 67ms -12.8%
SmallTuples: 64ms 65ms -0.5% 65ms 65ms -0.2%
SpecialClassAttribute: 148ms 149ms -0.8% 149ms 150ms -1.0%
SpecialInstanceAttribute: 54ms 54ms +0.2% 54ms 54ms +0.0%
StringMappings: 120ms 117ms +2.5% 120ms 117ms +2.5%
StringPredicates: 62ms 62ms +0.9% 62ms 62ms +1.0%
StringSlicing: 69ms 68ms +1.6% 69ms 68ms +2.1%
TryExcept: 37ms 37ms +0.0% 37ms 37ms +0.5%
TryFinally: 40ms 37ms +6.7% 40ms 37ms +6.5%
TryRaiseExcept: 19ms 20ms -1.0% 20ms 20ms -0.4%
TupleSlicing: 65ms 65ms +0.5% 66ms 65ms +1.2%
WithFinally: 57ms 56ms +1.9% 57ms 56ms +2.1%
WithRaiseExcept: 53ms 53ms +0.3% 54ms 54ms -0.8%
——————————————————————————-
Totals: 3154ms 3145ms +0.3% 3176ms 3177ms -0.0%

(this=nozombie, other=zombie)

I’m going to remove this weird, unbounded cache from the python interpreter we use on the PS3.

Killing a Stackless bug

What follows is an account of how I found and fixed an insidious bug in Stackless Python which has been there for years.  It’s one of those war stories.  Perhaps a bit long winded and technical and full of exaggerations as such stories tend to be.

Background

Some weeks ago, because of a problem in the client library we are using, I had to switch the http library we are using on the PS3 from using non-blocking IO to blocking. Previously, we were were issuing all the non-blocking calls, the “select” and the tasklet blocking / scheduling on the main thread. This is similar to how gevent and other such libraries do things. Switching to blocking calls, however, meant doing things on worker threads.

The approach we took was to implement a small pool of pyton workers which could execute arbitrary jobs. A new utility function, stacklesslib.util.call_async() then performed the asynchronous call by dispatching it to a worker thread. The idea of an call_async() is to have a different tasklet execute the callable while the caller blocks on a channel. The return value, or error, is then propagated to the originating tasklet using that channel. Stackless channels can be used to communicate between threads too. And synchronizing threads in stackless is even more conveninent than regular Python because there is stackless.atomic, which not only prevents involuntary scheduling of tasklets, it also prevents automatic yielding of the GIL (cPython folks, take note!)

This worked well, and has been running for some time. The drawback to this approach, of course, is that we now need to keep python threads around, consuming stack space. And Python needs a lot of stack.

The problem

The only problem was, that there appeared to be a bug present. One of our developers complained that sometimes, during long downloads, the http download function would return None, rather than the expected string chunk.

Now, this problem was hard to reproduce. It required a specific setup and geolocation was also an issue. This developer is in California, using servers in London. Hence, there ensued a somewhat prolonged interaction (hindered by badly overlapping time-zones) where I would provide him with modified .py files with instrumentation, and he would provide me with logs. We quickly determined, to my dismay, that apparently, sometimes a string was turning into None, while in transit trough a channel.send() to a channel.receive(). This was most distressing. Particularly because the channel in question was transporting data between threads and this particular functionality of stackless has not been as heavily used as the rest.

Tracking it down

So, I suspected a race condition of some sorts. But a careful review of the channel code and the scheduling code presented no obvious candidates. Also, the somehwat unpopular GIL was being used throughout, which if done correctly ensures that things work as expected.

To cut a long story short, by a lucky coincidence I managed to reproduce a different manifestation of the problem. In some cases, a simple interaction with a local HTTP server would cause this to happen.

When a channel sends data between tasklets, it is temporarily stored on the target tasklet’s “tempval” attribute. When the target wakes up, this is then taken and returned as the result from the “receive()” call. I was able to establish that after sending the data, the target tasklet did indeed hold the correct string value in its “tempval” attribute. I then needed to find out where and why it was disappearing from that place.

By adding instrumentation code to the stackless core, I established that this was happening in the last line of the following snippet:

PyObject *
slp_run_tasklet(void)
{
    PyThreadState *ts = PyThreadState_GET();
    PyObject *retval;

    if ( (ts->st.main == NULL) && initialize_main_and_current()) {
        ts->frame = NULL;
        return NULL;
    }

    TASKLET_CLAIMVAL(ts->st.current, &retval);

By setting a breakpoint, I was able to see that I was in the top level part of the “continue” bit of the “stack spilling” code

Stack spilling is a feature of stackless where the stack slicing mechanism is used to recycle a deep callstack. When it detects that the stack has grown beyond a certain limit, it is stored away, and a hard switch is done to the top again, where it continues its downwards crawl. This can help conserve stack address space, particularly on threads where the stack cannot grow dynamically.

So, something wrong with stack spilling, then.  But even so, this was unexpected. Why was stack spilling happening when data was being transmitted across a channel? Stack spilling normally occurs only when nesting regular .py code and other such things.

By setting a breakpoint at the right place, where the stack spilling code was being invoked, I finally arrived at this callstack:

Type Function
PyObject* slp_eval_frame_newstack(PyFrameObject* f, int exc, PyObject* retval)
PyObject* PyEval_EvalFrameEx_slp(PyFrameObject* f, int throwflag, PyObject* retval)
PyObject* slp_frame_dispatch(PyFrameObject* f, PyFrameObject* stopframe, int exc, PyObject* retval)
PyObject* PyEval_EvalCodeEx(PyCodeObject* co, PyObject* globals, PyObject* locals, PyObject** args, int argcount, PyObject** kws, int kwcount, PyObject** defs, int defcount, PyObject* closure)
PyObject* function_call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_Call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_CallFunctionObjArgs(PyObject* callable)
void PyObject_ClearWeakRefs(PyObject* object)
void tasklet_dealloc(PyTaskletObject* t)
void subtype_dealloc(PyObject* self)
int slp_transfer(PyCStackObject** cstprev, PyCStackObject* cst, PyTaskletObject* prev)
PyObject* slp_schedule_task(PyTaskletObject* prev, PyTaskletObject* next, int stackless, int* did_switch)
PyObject* generic_channel_action(PyChannelObject* self, PyObject* arg, int dir, int stackless)
PyObject* impl_channel_receive(PyChannelObject* self)
PyObject* call_function(PyObject*** pp_stack, int oparg)

Notice the “subtype_dealloc”. This callstack indicates that in the channel receive code, after the hard switch back to the target tasklet, a Py_DECREF was causing side effects, which again caused stack spilling to occur. The place was this, in slp_transfer():

/* release any objects that needed to wait until after the switch. */
Py_CLEAR(ts->st.del_post_switch);

This is code that does cleanup after tasklet switch, such as releasing the last remaining reference of the previous tasklet.

So, the bug was clear then. It was twofold:

  1. A Py_CLEAR() after switching was not careful enough to store the current tasklet’s “tempval” out of harms way of any side-effects a Py_DECREF() might cause, and
  2. Stack slicing itself, when it happened, clobbered the current tasklet’s “tempval”

The bug was subsequently fixed by repairing stack spilling and spiriting “tempval” away during the Py_CLEAR() call.

Post mortem

The inter-thread communication turned out to be a red herring. The problem was caused by an unfortunate juxtaposition of channel communication, tasklet deletion, and stack spilling.
But why had we not seen this before? I think it is largely due to the fact that stack spilling only rarely comes into play on regular platforms. On the PS3, we deliberately set the threshold low to conserve memory space. This is also not the first stack-spilling related bug we have seen on the PS3, but the first one for two years. Hopefully it will be the last.

Since this morning, the fix is in the stackless repository at http://hg.python.org/stackless