Mutable default arguments are your friend.

In a recent comment to an elderly post of mine, I was asked about the following code:

def mywrapper(func, args=(), kwargs={}):
    ...

The commenter though that I should have made a special mention about using dict as a default argument, “because it’s such a common gotcha.”

My response is twofold:

  1. This particular case is idiomatic, and widely used for functions that call other functions.
  2. I actually don’t think mutable default arguments are a problem and that they don’t deserve all the stigma they are getting.

I want to expand on point 2 a bit here.

Continue reading

PythonPlus

Time to write a little bit about this little project of mine.

tl;dr

Multithreading more responsive in a Python 2.7.  30% more requests per second.  Satisfaction guaranteed!

Introduction

After leaving my old job at CCP Games last year, I had the urge to try to collect some of the stuff that we had done for Python 2.7 over there and make it available to the world.  So I started this little fork off 2.7.

The idea was to have a place to add “improvements” to vanilla (as opposed to Stackless) 2.7, so that they could be kept separately and in sync with CPython 2.7.

Thus far, what I’ve been mostly focusing on is modernizing thread support.  (for a full list of changes, see the whatsnew file).

When we were working on DUST 514 for the Playstation I had to make certain improvements to make networking work more efficiently on that platform.  We were interfacing stackless python with the native http api of the PS3, and had to use blocking worker threads.  Marshaling from those threads to tasklets was causing unnecessary latency.

We ended up doing a lot of experiments with condition variables, in the end, providing native C implementations to minimize GIL thrashing and reducing wakeup latency to the minimum.

In PythonPlus I have done this and some other stuff in order to improve threading performance.

The threading related changes cover among other things:

  1. Adding timeout parameters to blocking calls as in the 3.x api.
  2. Adding a native threading.Condition object
  3. Improving the GIL

Adding a native Condition object aims to reduce the thread thrashing that is otherwise associated with condition variables, since a lot lof locking and context switching needs to happen for a thread to wake up with the normal .py version of those constructs.  To do this, however, the internal non-recursive locks need to be implemented using a lock and a condition variable themselves, rather than using native semaphore objects.

Changing the lock types used required the GIL to be visited, since the behaviour of the old GIL was just a random side effect of the choice of internal locks.  This also allowed me to address the old Beazley problem while at it.

The GIL change is minor.  It is simply a separate function, and when a CPU bound thread wishes to yield the GIL to another thread, it calls a new api function, _PyThread_yield_GIL().  Threads that are trying to re-aquire the GIL after unlocking them, are considered to be IO threads and have priority for the GIL when a CPU thread yields it.  But if no such thread is present, then the GIL won’t actually be yielded 99 out of every 100 yields.  This minimizes unnecessary thrashing among CPU threads, while allowing IO threads to quickly get their foot in when required.

Performance

I quickly got this all up and running, but then I had to prove it to be actually better than regular 2.7.  To do this, I set up two test scenarios:

  1. Tools/plus/giltest.py – a test platform to measure performance of concurrent cpu threads as well as the performance of pairs of producer/consumer threads synchronized either with threading.Condition or threading.Lock
  2. Tools/plus/testserver.py – a multithreaded webserver using a pool of thread and socketserver.py, being exercised by ab.

On windows, I found it easy to see improvements.  I got the GIL to behave better and I got the web server to increase throughput.  producer/consumer pairs using Condition variables got a real performance boost and those IO threads got a priority boost over regular CPU bound threads as expected.

However, my virtual linux box was more disappointing.  Tests showed that just replacing the native non-recursive lock which was based on the posix sem_t object with a construct using pthread_mutex_t and pthread_cond_t, slowed down execution.

Fixing linux

I decided that there ought ot be no good reason for a pthread_cond_t to be so much slower than a sem_t, so I decided to write my own condition object using a sem_t.  To make a long story short, it worked.  My emulated condition variable (written using a pthread_mutex_t and a sem_t) is faster than a pthread_condition_t. At least on my dual core virtual box.  Go figure.

The making of this new condition variable is a topic for a blog post on its own.  I doggedly refused to look up other implementations of condition variables based on semaphores, and wanted to come up with a solution on my own that did not violate the more subtle promises that the protocol makes.  Along the way, I was guided by failing unittests of the threading.Barrier class, which relies on the underlying threading.Condition to be true to its promise.  I was actually stuck on this problem for a few months, but after a recent eureka moment I think I succeeded.

The results

So, this has been some months in the making.  I set up the header files so that various aspects of my patch could be switched on or off, and a macro especially for performance testing then sets these in a sensible way.

giltest.py

First, the results of the giltest.py file, with various settings of the macro and on windows and linux:

giltest

Some notes on this are in order.

  1. e is “efficiency”, the cpu throughput of two concurrent cpu threads (incrementing a variable) compared to just one thread.
  2. prod/con is a pair of producer/consumer threads using a threading.Lock primitive, and the column shows the number of transactions in a time-frame (one second)
  3. The green bit shows why a GIL improvement was necessary since IO threads just couldn’t get any priority over a cpu thread.  This column is showing prod/con transactions in the presence of a cpu thread.
  4. In the end, the improvements on linux are modest.  Maybe it is because of my virtual machine.  But the Beazley problem is fixed, and IO responsiveness picks up.  On windows it is more pronounced.
  5. The final column is a pair of producer/consumer threads synchronized using a threading.Condition object.  Notice on windows how performance picks up almost threefold, ending up being some 60% of a pair that’s synchronized with a threading.Lock.

 testserver.py

Now for more real-world like results.  Here the aim was to show that running many requests in parallel was better handled using the new system.  Again, improvements on linux are harder to gauge.  In fact, my initial attempts were so disappointing on linux that I almost scrapped the project.  But when I thought to rewrite the condition variable, things changed.

testserver

  1. Notice how performance picks up with “emulated condvar” on linux (green boxes) (on windows, it is always emulated)
  2. p=1 and p=10 are the number of parallel requests that are made.  “ab” is single threaded, it shoots off n requests and then waits for them all to finish before doing the next batch, so this is perhaps not completely real-world.
  3. On linux, rps (requests per second) go up for the multi-threaded case, both when we add the new GIL (better IO responsiveness) and when we add the native threading.Condition.  Combined, it improves 30%.
  4. On windows, we see the same, except that the biggest improvement is when we modify the locks (orange boxes).
  5. On windows, we achieve better throughput with multithreading.  I.e. multiple requests now work better than single requests, whereas on linux, multiple requests performed worse.

Conclusion

These tests were performed on a dual core laptop, running windows 7.  The linux tests were done in a virtual ubuntu machine on the same laptop, using two cpus.  I’m sure that the virtual nature has its effect on the results, and so, caveat emptor.

Overall, we get 30% improvement in responsiveness when there are multiple threads doing requests using this new threading code in Python Plus.  For real world applications serving web pages, that ought to matter.

On windows, the native implementation of threading.Condition provides a staggering 167% boost in performance of two threads doing rendezvous using a condition variable.

While optimizing the linux case, I uncovered how pthread_cond_t is curiously inefficient.  A “greedy” implementation of a condition variable using the posix sem_t showed dramatic improvement on my virtual machine.  I haven’t replicated this on native linux, but I suspect that the implementors of the pthread library are using explicit scheduling, whereas we rely on the presumably greedy scheduling semantics of the semaphore primitive.  But perhaps a separate blog post on this is in order, after some more research.

Fun stuff.

Blog moved

So! My previous blog (at blogs.ccpgames.com) disappeared.  It has taken this long for me to get things back up and running.  The previous blog was on a private server run by an external party and it was compromised by one of those sneaky internet maladies that infect sites like these.

I’m in the process of salvaging all the posts from there and get them running here.  This is a somewhat painstaking process.  I was provided with some files in a folder and I have had to re-learn unix (Its been 10 years since I used that regularly), learn about Apache, WordPress, mysql, multi-site installs and all kinds of things.  10 years ago, they didn’t have sudo.  I’m not sure that I like it.

Anyway, I hope to finish this soon and start blogging again about my adventures in Python. PyCon 2013 is coming up and I have, as always, some ideas to put forward and Swiss army knives to grind.

Zombieframes. A gratuitous optimization?

Examing a recent crash case, I stumbled across this code in frameobject.c:

PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
...
if (code->co_zombieframe != NULL) {
f = code->co_zombieframe;
code->co_zombieframe = NULL;
_Py_NewReference((PyObject *)f);
assert(f->f_code == code);
}

Intrigued by the name, I examined the header where it is defined, code.h:

...
void *co_zombieframe; /* for optimization only (see frameobject.c) */
...
} PyCodeObject;

It turns out that for every PyCodeObject object that has been executed, a PyFrameObject of a suitable size is cached and kept with the code object. Now, caching is fine and good, but this cache is unbounded. Every code object has the potential to hang on to a frame, which may then never be released.
Further, there is a separate freelist cache for PyFrameObjects already, in case a frame is not found on the code object:

if (free_list == NULL) {
f = PyObject_GC_NewVar(PyFrameObject, &PyFrame_Type,
extras);
if (f == NULL) {
Py_DECREF(builtins);
return NULL;
}
}
else {
assert(numfree > 0);
--numfree;
f = free_list;
free_list = free_list->f_back;
...

Always concious about memory these days, I tried disabling this in version 3.3 and running the pybench test. I was not able to see any conclusive difference in execution speed.

Update:

Disabling the zombieframe on the PS3 shaved off some 50k on startup.  Not the jackpot, but still, small things add up.

——————————————————————————-
PYBENCH 2.1
——————————————————————————-
* using CPython 3.3.0a3+ (default, May 23 2012, 20:02:34) [MSC v.1600 64 bit (AMD64)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=2.9680909446810176e-07, implementation=QueryPerformanceCounter()

——————————————————————————-
Benchmark: nozombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:02:34 (#default)
Unicode: UCS4

——————————————————————————-
Comparing with: zombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:00:42 (#default)
Unicode: UCS4

Test minimum run-time average run-time
this other diff this other diff
——————————————————————————-
BuiltinFunctionCalls: 51ms 52ms -3.3% 52ms 53ms -2.0%
BuiltinMethodLookup: 33ms 33ms +0.0% 34ms 34ms +0.8%
CompareFloats: 50ms 50ms +0.1% 50ms 50ms +0.4%
CompareFloatsIntegers: 99ms 98ms +0.8% 99ms 99ms +0.6%
CompareIntegers: 77ms 77ms -0.5% 77ms 77ms -0.3%
CompareInternedStrings: 60ms 60ms +0.0% 61ms 61ms -0.1%
CompareLongs: 46ms 45ms +1.5% 46ms 45ms +1.2%
CompareStrings: 61ms 59ms +3.6% 61ms 59ms +3.6%
ComplexPythonFunctionCalls: 60ms 58ms +3.3% 60ms 58ms +3.2%
ConcatStrings: 48ms 47ms +2.4% 48ms 47ms +2.1%
CreateInstances: 58ms 57ms +1.3% 59ms 58ms +1.3%
CreateNewInstances: 43ms 43ms +1.1% 44ms 44ms +1.1%
CreateStringsWithConcat: 79ms 79ms -0.3% 79ms 79ms -0.1%
DictCreation: 71ms 71ms +0.4% 72ms 72ms +1.0%
DictWithFloatKeys: 72ms 70ms +2.1% 72ms 71ms +1.8%
DictWithIntegerKeys: 46ms 46ms +0.7% 46ms 46ms +0.4%
DictWithStringKeys: 41ms 41ms +0.0% 41ms 41ms -0.1%
ForLoops: 35ms 37ms -4.0% 35ms 37ms -4.0%
IfThenElse: 64ms 64ms -0.1% 64ms 64ms -0.4%
ListSlicing: 49ms 50ms -1.0% 53ms 53ms -0.8%
NestedForLoops: 54ms 51ms +6.7% 55ms 51ms +6.7%
NestedListComprehensions: 54ms 54ms -0.7% 54ms 55ms -2.2%
NormalClassAttribute: 94ms 94ms +0.1% 94ms 94ms +0.1%
NormalInstanceAttribute: 54ms 54ms +0.3% 54ms 54ms +0.2%
PythonFunctionCalls: 58ms 57ms +0.8% 58ms 58ms +0.6%
PythonMethodCalls: 65ms 61ms +6.3% 66ms 62ms +5.9%
Recursion: 84ms 85ms -1.0% 85ms 85ms -0.9%
SecondImport: 74ms 76ms -2.5% 74ms 77ms -3.5%
SecondPackageImport: 75ms 78ms -3.8% 76ms 79ms -3.9%
SecondSubmoduleImport: 163ms 169ms -3.4% 164ms 170ms -3.3%
SimpleComplexArithmetic: 43ms 43ms +1.0% 43ms 43ms +1.0%
SimpleDictManipulation: 80ms 78ms +2.2% 81ms 79ms +2.4%
SimpleFloatArithmetic: 42ms 42ms +0.1% 42ms 42ms -0.0%
SimpleIntFloatArithmetic: 52ms 53ms -1.2% 52ms 53ms -1.1%
SimpleIntegerArithmetic: 52ms 52ms -0.7% 52ms 53ms -0.8%
SimpleListComprehensions: 45ms 45ms -0.2% 45ms 45ms +0.3%
SimpleListManipulation: 44ms 46ms -4.0% 44ms 46ms -3.9%
SimpleLongArithmetic: 32ms 32ms -0.9% 32ms 32ms -0.1%
SmallLists: 58ms 57ms +1.2% 58ms 67ms -12.8%
SmallTuples: 64ms 65ms -0.5% 65ms 65ms -0.2%
SpecialClassAttribute: 148ms 149ms -0.8% 149ms 150ms -1.0%
SpecialInstanceAttribute: 54ms 54ms +0.2% 54ms 54ms +0.0%
StringMappings: 120ms 117ms +2.5% 120ms 117ms +2.5%
StringPredicates: 62ms 62ms +0.9% 62ms 62ms +1.0%
StringSlicing: 69ms 68ms +1.6% 69ms 68ms +2.1%
TryExcept: 37ms 37ms +0.0% 37ms 37ms +0.5%
TryFinally: 40ms 37ms +6.7% 40ms 37ms +6.5%
TryRaiseExcept: 19ms 20ms -1.0% 20ms 20ms -0.4%
TupleSlicing: 65ms 65ms +0.5% 66ms 65ms +1.2%
WithFinally: 57ms 56ms +1.9% 57ms 56ms +2.1%
WithRaiseExcept: 53ms 53ms +0.3% 54ms 54ms -0.8%
——————————————————————————-
Totals: 3154ms 3145ms +0.3% 3176ms 3177ms -0.0%

(this=nozombie, other=zombie)

I’m going to remove this weird, unbounded cache from the python interpreter we use on the PS3.

Clearing weakrefs

I just had this problem which would have been elegantly solved with the ability to manually clear weak references pointing to an object. I am (for technical reasons) recycling an object, so instead of killing it and re-creating it, I re-initialize it. But that leaves old weak references in place. How nice wouldn’t it be to be able to call “myobject.clear_weakrefs()”?