Python 3000

I just found a document I wrote back in 2007 about the, then proposed, Python3000 project. This later became Python 3.0. I thought it might be amusing to re-print it here. Some of the musings in there later found its way into python albeit much later.

I may have posted this to python-dev at the time, or just left it in my outbox. I can’t remember. At the time, I had become a bit frustrated with the way things were being run in the organization so I may have not communicated this at all.

So, here is the stuff: Enjoy.

Issues and points about Python 3000

We at CCP think it is important to use the breaking change that is Python 3000 to do some more radical reforms, which otherwise are probably not going to happen ever.  Since things are changing anyway, why not change them a bit more and make everything better?

  • C API is inconsistent. Need to adapt a consistency, such as always returning a new reference by indirection (borrow idea from COM)
    //create a new integer. A new object returned in *result
    bool PyInt_FromLong(PyObject **result, long value);
    //look up stuff in the dict. Return a borrowed reference
    PyObject *PyDict_GetItem(PyObject *key);
  • Python3000 Should really move into C99 (or even C++?) land. After all, platforms are being dropped and every one should provide C99 support (gcc3.0 and beyond). What I want so see is not class-based implementation of Python, but rather language features such as:
    1. the inline keyword. Get rid of macros where they aren’t needed (C99)
    2. in-code declaration of variables (C99).
    3. the bool keyword (C99)
    4. Possibly use std c++ library features where useful, such as for strings/unicode and some basic STL containers (C++).
    5. There are toolkits like Boost.python and swig for making it easier to build C extensions.  Why doesn’t python do such things itself for builtin types, to reduce the size of the code and increase reliability?  Why are such layers needed in the first place?
  • It should be easier to inject custom memory handlers into python. Currently this is a nightmare because of the three different memory APIs used (object alloc, malloc, pymalloc). This would be made much easier using inline functions, see above.
  • Too often Exceptions are used for control flow. Even in the C api. For example, the PyObject_HasAttr() API is implemented through a PyObject_GetAttr() which then clears the generated exception. Creating these exceptions is expensive. The PyObject_HasAttr() should really be implemented properly as a query function which optionally can raise an exception:
    //look up an attribute.  If "hard" is false, no exception is raised if it is not found.
    bool PyObject_GetAttr(PyObject **result, PyObject *object, PyObject *attrib, bool hard);
  • (Possibly already addressed?) Fix exception handling semantics. Python has no idea of being “in an exception handling context.” Therefore, the exception last caught sits in the TLS even after it has been handled. Exception handling context should be separately stackable. When one is executing the exception handler for exception A, that exception is current. We can then raise and catch a different exception during the handling of A, say B. We can handle B, then after that except: clause, B is popped and we get back to handling A. When exiting the except: clause for A, A is popped an no exception is current. This achieves two things:
    1. the exception A doesn’t linger within the interpreter any longer than necessary, keeping huge resources and stack frames alive.
    2. It is possible to query the language if it is in an exception handling context. The presence of a current exception would indicate this.
  • We want to help guide Python 3000 to be future safe. Make a sensible api. Think about threads. Think about C++ and boost. Think about hard/soft stack switching and IO.
    1. Make the API so that the code inherently uses a C stack free internal implementation. Just an execution engine. Stackless.
    2. consider using Boost python, or conversely, make Boost benefit from Python. This might mean that the whole thing ought to be written in C++ anyway. At the very least, consult with the maker of Boost Python.  Currently too much boilerplate in the implementation of Python and also when creating extensions and types – repeating the same macros over and over is not fun. 
    3. The point is, soft switching could be a step to unify generators/co-routines in Python with stackless. Why not use the same kind of continuation-like mechanism for both?
    4. Hard switching is needed to do tricks from C code. This is why you can’t write a generator function in C
  • Multi threading! All the processor growth the next few years will be horizontal, with added cpus. To leverage that, games have to be more and more threaded. Several areas of Python don’t jive with threading:
    1. Reference counting (atomic lock-and-increment could help)
    2. global constructs such as the linked list of objects, which brings us to:
    3. garbage collection.
    4. Various modules have their own nifty little recycling mechanism for objecst, such as string, tuple, int, etc etc. They maintain lists of empty objects for usage and these are not threadsafe.
    5. The interpreter has a GIL. Maybe it is okay, but lots of stuff in python can have the side effect of running python code.
    6. We are mainly thinking about parallelizing stuff from the C API side.
    7. Worker threads need to be able to manipulate python constructs, and create python constructs for consumption of other threads.

Mutable default arguments are your friend.

In a recent comment to an elderly post of mine, I was asked about the following code:

def mywrapper(func, args=(), kwargs={}):
    ...

The commenter though that I should have made a special mention about using dict as a default argument, “because it’s such a common gotcha.”

My response is twofold:

  1. This particular case is idiomatic, and widely used for functions that call other functions.
  2. I actually don’t think mutable default arguments are a problem and that they don’t deserve all the stigma they are getting.

I want to expand on point 2 a bit here.

Continue reading

PythonPlus

Time to write a little bit about this little project of mine.

tl;dr

Multithreading more responsive in a Python 2.7.  30% more requests per second.  Satisfaction guaranteed!

Introduction

After leaving my old job at CCP Games last year, I had the urge to try to collect some of the stuff that we had done for Python 2.7 over there and make it available to the world.  So I started this little fork off 2.7.

The idea was to have a place to add “improvements” to vanilla (as opposed to Stackless) 2.7, so that they could be kept separately and in sync with CPython 2.7.

Thus far, what I’ve been mostly focusing on is modernizing thread support.  (for a full list of changes, see the whatsnew file).

When we were working on DUST 514 for the Playstation I had to make certain improvements to make networking work more efficiently on that platform.  We were interfacing stackless python with the native http api of the PS3, and had to use blocking worker threads.  Marshaling from those threads to tasklets was causing unnecessary latency.

We ended up doing a lot of experiments with condition variables, in the end, providing native C implementations to minimize GIL thrashing and reducing wakeup latency to the minimum.

In PythonPlus I have done this and some other stuff in order to improve threading performance.

The threading related changes cover among other things:

  1. Adding timeout parameters to blocking calls as in the 3.x api.
  2. Adding a native threading.Condition object
  3. Improving the GIL

Adding a native Condition object aims to reduce the thread thrashing that is otherwise associated with condition variables, since a lot lof locking and context switching needs to happen for a thread to wake up with the normal .py version of those constructs.  To do this, however, the internal non-recursive locks need to be implemented using a lock and a condition variable themselves, rather than using native semaphore objects.

Changing the lock types used required the GIL to be visited, since the behaviour of the old GIL was just a random side effect of the choice of internal locks.  This also allowed me to address the old Beazley problem while at it.

The GIL change is minor.  It is simply a separate function, and when a CPU bound thread wishes to yield the GIL to another thread, it calls a new api function, _PyThread_yield_GIL().  Threads that are trying to re-aquire the GIL after unlocking them, are considered to be IO threads and have priority for the GIL when a CPU thread yields it.  But if no such thread is present, then the GIL won’t actually be yielded 99 out of every 100 yields.  This minimizes unnecessary thrashing among CPU threads, while allowing IO threads to quickly get their foot in when required.

Performance

I quickly got this all up and running, but then I had to prove it to be actually better than regular 2.7.  To do this, I set up two test scenarios:

  1. Tools/plus/giltest.py – a test platform to measure performance of concurrent cpu threads as well as the performance of pairs of producer/consumer threads synchronized either with threading.Condition or threading.Lock
  2. Tools/plus/testserver.py – a multithreaded webserver using a pool of thread and socketserver.py, being exercised by ab.

On windows, I found it easy to see improvements.  I got the GIL to behave better and I got the web server to increase throughput.  producer/consumer pairs using Condition variables got a real performance boost and those IO threads got a priority boost over regular CPU bound threads as expected.

However, my virtual linux box was more disappointing.  Tests showed that just replacing the native non-recursive lock which was based on the posix sem_t object with a construct using pthread_mutex_t and pthread_cond_t, slowed down execution.

Fixing linux

I decided that there ought ot be no good reason for a pthread_cond_t to be so much slower than a sem_t, so I decided to write my own condition object using a sem_t.  To make a long story short, it worked.  My emulated condition variable (written using a pthread_mutex_t and a sem_t) is faster than a pthread_condition_t. At least on my dual core virtual box.  Go figure.

The making of this new condition variable is a topic for a blog post on its own.  I doggedly refused to look up other implementations of condition variables based on semaphores, and wanted to come up with a solution on my own that did not violate the more subtle promises that the protocol makes.  Along the way, I was guided by failing unittests of the threading.Barrier class, which relies on the underlying threading.Condition to be true to its promise.  I was actually stuck on this problem for a few months, but after a recent eureka moment I think I succeeded.

The results

So, this has been some months in the making.  I set up the header files so that various aspects of my patch could be switched on or off, and a macro especially for performance testing then sets these in a sensible way.

giltest.py

First, the results of the giltest.py file, with various settings of the macro and on windows and linux:

giltest

Some notes on this are in order.

  1. e is “efficiency”, the cpu throughput of two concurrent cpu threads (incrementing a variable) compared to just one thread.
  2. prod/con is a pair of producer/consumer threads using a threading.Lock primitive, and the column shows the number of transactions in a time-frame (one second)
  3. The green bit shows why a GIL improvement was necessary since IO threads just couldn’t get any priority over a cpu thread.  This column is showing prod/con transactions in the presence of a cpu thread.
  4. In the end, the improvements on linux are modest.  Maybe it is because of my virtual machine.  But the Beazley problem is fixed, and IO responsiveness picks up.  On windows it is more pronounced.
  5. The final column is a pair of producer/consumer threads synchronized using a threading.Condition object.  Notice on windows how performance picks up almost threefold, ending up being some 60% of a pair that’s synchronized with a threading.Lock.

 testserver.py

Now for more real-world like results.  Here the aim was to show that running many requests in parallel was better handled using the new system.  Again, improvements on linux are harder to gauge.  In fact, my initial attempts were so disappointing on linux that I almost scrapped the project.  But when I thought to rewrite the condition variable, things changed.

testserver

  1. Notice how performance picks up with “emulated condvar” on linux (green boxes) (on windows, it is always emulated)
  2. p=1 and p=10 are the number of parallel requests that are made.  “ab” is single threaded, it shoots off n requests and then waits for them all to finish before doing the next batch, so this is perhaps not completely real-world.
  3. On linux, rps (requests per second) go up for the multi-threaded case, both when we add the new GIL (better IO responsiveness) and when we add the native threading.Condition.  Combined, it improves 30%.
  4. On windows, we see the same, except that the biggest improvement is when we modify the locks (orange boxes).
  5. On windows, we achieve better throughput with multithreading.  I.e. multiple requests now work better than single requests, whereas on linux, multiple requests performed worse.

Conclusion

These tests were performed on a dual core laptop, running windows 7.  The linux tests were done in a virtual ubuntu machine on the same laptop, using two cpus.  I’m sure that the virtual nature has its effect on the results, and so, caveat emptor.

Overall, we get 30% improvement in responsiveness when there are multiple threads doing requests using this new threading code in Python Plus.  For real world applications serving web pages, that ought to matter.

On windows, the native implementation of threading.Condition provides a staggering 167% boost in performance of two threads doing rendezvous using a condition variable.

While optimizing the linux case, I uncovered how pthread_cond_t is curiously inefficient.  A “greedy” implementation of a condition variable using the posix sem_t showed dramatic improvement on my virtual machine.  I haven’t replicated this on native linux, but I suspect that the implementors of the pthread library are using explicit scheduling, whereas we rely on the presumably greedy scheduling semantics of the semaphore primitive.  But perhaps a separate blog post on this is in order, after some more research.

Fun stuff.

Blog moved

So! My previous blog (at blogs.ccpgames.com) disappeared.  It has taken this long for me to get things back up and running.  The previous blog was on a private server run by an external party and it was compromised by one of those sneaky internet maladies that infect sites like these.

I’m in the process of salvaging all the posts from there and get them running here.  This is a somewhat painstaking process.  I was provided with some files in a folder and I have had to re-learn unix (Its been 10 years since I used that regularly), learn about Apache, WordPress, mysql, multi-site installs and all kinds of things.  10 years ago, they didn’t have sudo.  I’m not sure that I like it.

Anyway, I hope to finish this soon and start blogging again about my adventures in Python. PyCon 2013 is coming up and I have, as always, some ideas to put forward and Swiss army knives to grind.

Zombieframes. A gratuitous optimization?

Examing a recent crash case, I stumbled across this code in frameobject.c:

PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
...
if (code->co_zombieframe != NULL) {
f = code->co_zombieframe;
code->co_zombieframe = NULL;
_Py_NewReference((PyObject *)f);
assert(f->f_code == code);
}

Intrigued by the name, I examined the header where it is defined, code.h:

...
void *co_zombieframe; /* for optimization only (see frameobject.c) */
...
} PyCodeObject;

It turns out that for every PyCodeObject object that has been executed, a PyFrameObject of a suitable size is cached and kept with the code object. Now, caching is fine and good, but this cache is unbounded. Every code object has the potential to hang on to a frame, which may then never be released.
Further, there is a separate freelist cache for PyFrameObjects already, in case a frame is not found on the code object:

if (free_list == NULL) {
f = PyObject_GC_NewVar(PyFrameObject, &PyFrame_Type,
extras);
if (f == NULL) {
Py_DECREF(builtins);
return NULL;
}
}
else {
assert(numfree > 0);
--numfree;
f = free_list;
free_list = free_list->f_back;
...

Always concious about memory these days, I tried disabling this in version 3.3 and running the pybench test. I was not able to see any conclusive difference in execution speed.

Update:

Disabling the zombieframe on the PS3 shaved off some 50k on startup.  Not the jackpot, but still, small things add up.

——————————————————————————-
PYBENCH 2.1
——————————————————————————-
* using CPython 3.3.0a3+ (default, May 23 2012, 20:02:34) [MSC v.1600 64 bit (AMD64)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=2.9680909446810176e-07, implementation=QueryPerformanceCounter()

——————————————————————————-
Benchmark: nozombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:02:34 (#default)
Unicode: UCS4

——————————————————————————-
Comparing with: zombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:00:42 (#default)
Unicode: UCS4

Test minimum run-time average run-time
this other diff this other diff
——————————————————————————-
BuiltinFunctionCalls: 51ms 52ms -3.3% 52ms 53ms -2.0%
BuiltinMethodLookup: 33ms 33ms +0.0% 34ms 34ms +0.8%
CompareFloats: 50ms 50ms +0.1% 50ms 50ms +0.4%
CompareFloatsIntegers: 99ms 98ms +0.8% 99ms 99ms +0.6%
CompareIntegers: 77ms 77ms -0.5% 77ms 77ms -0.3%
CompareInternedStrings: 60ms 60ms +0.0% 61ms 61ms -0.1%
CompareLongs: 46ms 45ms +1.5% 46ms 45ms +1.2%
CompareStrings: 61ms 59ms +3.6% 61ms 59ms +3.6%
ComplexPythonFunctionCalls: 60ms 58ms +3.3% 60ms 58ms +3.2%
ConcatStrings: 48ms 47ms +2.4% 48ms 47ms +2.1%
CreateInstances: 58ms 57ms +1.3% 59ms 58ms +1.3%
CreateNewInstances: 43ms 43ms +1.1% 44ms 44ms +1.1%
CreateStringsWithConcat: 79ms 79ms -0.3% 79ms 79ms -0.1%
DictCreation: 71ms 71ms +0.4% 72ms 72ms +1.0%
DictWithFloatKeys: 72ms 70ms +2.1% 72ms 71ms +1.8%
DictWithIntegerKeys: 46ms 46ms +0.7% 46ms 46ms +0.4%
DictWithStringKeys: 41ms 41ms +0.0% 41ms 41ms -0.1%
ForLoops: 35ms 37ms -4.0% 35ms 37ms -4.0%
IfThenElse: 64ms 64ms -0.1% 64ms 64ms -0.4%
ListSlicing: 49ms 50ms -1.0% 53ms 53ms -0.8%
NestedForLoops: 54ms 51ms +6.7% 55ms 51ms +6.7%
NestedListComprehensions: 54ms 54ms -0.7% 54ms 55ms -2.2%
NormalClassAttribute: 94ms 94ms +0.1% 94ms 94ms +0.1%
NormalInstanceAttribute: 54ms 54ms +0.3% 54ms 54ms +0.2%
PythonFunctionCalls: 58ms 57ms +0.8% 58ms 58ms +0.6%
PythonMethodCalls: 65ms 61ms +6.3% 66ms 62ms +5.9%
Recursion: 84ms 85ms -1.0% 85ms 85ms -0.9%
SecondImport: 74ms 76ms -2.5% 74ms 77ms -3.5%
SecondPackageImport: 75ms 78ms -3.8% 76ms 79ms -3.9%
SecondSubmoduleImport: 163ms 169ms -3.4% 164ms 170ms -3.3%
SimpleComplexArithmetic: 43ms 43ms +1.0% 43ms 43ms +1.0%
SimpleDictManipulation: 80ms 78ms +2.2% 81ms 79ms +2.4%
SimpleFloatArithmetic: 42ms 42ms +0.1% 42ms 42ms -0.0%
SimpleIntFloatArithmetic: 52ms 53ms -1.2% 52ms 53ms -1.1%
SimpleIntegerArithmetic: 52ms 52ms -0.7% 52ms 53ms -0.8%
SimpleListComprehensions: 45ms 45ms -0.2% 45ms 45ms +0.3%
SimpleListManipulation: 44ms 46ms -4.0% 44ms 46ms -3.9%
SimpleLongArithmetic: 32ms 32ms -0.9% 32ms 32ms -0.1%
SmallLists: 58ms 57ms +1.2% 58ms 67ms -12.8%
SmallTuples: 64ms 65ms -0.5% 65ms 65ms -0.2%
SpecialClassAttribute: 148ms 149ms -0.8% 149ms 150ms -1.0%
SpecialInstanceAttribute: 54ms 54ms +0.2% 54ms 54ms +0.0%
StringMappings: 120ms 117ms +2.5% 120ms 117ms +2.5%
StringPredicates: 62ms 62ms +0.9% 62ms 62ms +1.0%
StringSlicing: 69ms 68ms +1.6% 69ms 68ms +2.1%
TryExcept: 37ms 37ms +0.0% 37ms 37ms +0.5%
TryFinally: 40ms 37ms +6.7% 40ms 37ms +6.5%
TryRaiseExcept: 19ms 20ms -1.0% 20ms 20ms -0.4%
TupleSlicing: 65ms 65ms +0.5% 66ms 65ms +1.2%
WithFinally: 57ms 56ms +1.9% 57ms 56ms +2.1%
WithRaiseExcept: 53ms 53ms +0.3% 54ms 54ms -0.8%
——————————————————————————-
Totals: 3154ms 3145ms +0.3% 3176ms 3177ms -0.0%

(this=nozombie, other=zombie)

I’m going to remove this weird, unbounded cache from the python interpreter we use on the PS3.

Clearing weakrefs

I just had this problem which would have been elegantly solved with the ability to manually clear weak references pointing to an object. I am (for technical reasons) recycling an object, so instead of killing it and re-creating it, I re-initialize it. But that leaves old weak references in place. How nice wouldn’t it be to be able to call “myobject.clear_weakrefs()”?

Temporary thread state overhead

When doing IO, it is sometimes useful for a worker thread to notify Python that something has happened. Previously we have just had the Python main thread “Poll” some external variable for that, but recently we have been experimenting with having the main thread just grab the GIL and perform python work itself.

This should be straightforward. Python has an api called PyGILState_Ensure() that can be called on any thread. If that thread doesn’t already have a Python thread state, it will create a temporary one. Such a thread is sometimes called an external thread.

On a server loaded to some 40% with IO, this is what happened when I turned on this feature:

process cpu

The dark gray area is main thread CPU, (initially at around 40%) and the rest is other threads.  Turning on the “ThreadWakeup” feature adds some 20% extra cpu work to the process.

When the main thread is not working, it is idle doing a MsgWaitForMultipleObjects() Windows system call (with the GIL unclaimed).  So the worker thread should have no problem acquiring the GIL.  Further, there is only ever one woker thread doing a PyGILState_Ensure()/PyGILState_Release() at the same time, and this is ensured using locking on the worker thread side.

Further tests seem to confirm that if the worker thread already owns a Python thread state, and uses that to aquire the GIL (using a PyEval_RestoreThread() call) this overhead goes away.

This was surprising to me, but it seems to indicate that it is very expensive to “acquire a thread state on demand” to claim the GIL.  This is very unfortunate, because it means that one cannot easily use arbitrary system threads to call into Python without significant overhead.  These might be threads from the Windows thread pool for example, threads that we have no control over and therefore cannot assign thread state to.

I will try to investigate this furter, to see where the overhead is coming from.  It could be the extra TLS calls made, or simply the cost of malloc()/free() involved.  Depending on the results, there are a few options:

  1. Keep a single thread state on the side for (the single) external thread that can claim the GIL at a time, ready and initialized.
  2. Allow an external thread to ‘borrow’ another thread state and not use its own.
  3. Streamline the stuff already present.

Update, oct. 6th 2011:
Enabling dynamic GIL with tread state caching did notthing to solve this issue.
I think the problem is likely to be that spin locking is in effect for the GIL. I’ll see what happens if I explicitly define the GIL to not use spin locking.

namedtuple and exec()

In our port of Python 2.7 to the PS3 console, we have deliberately removed the python compiler. This was mainly done to save on the code size, since on a console every byte is sacred.  An additional benefit is slight hardening against certain kinds of attacks, since evil constructs such as eval() and exec() now raise the NotImplementedError when used.

Program code is pre-compiled and put in .zip archives so there is no need for regular compilation on the console. The most serious problem we encountered though, was with the new namedtuple construct.

The namedtuple is implemented in the collections module by constructing a class declaration with string interpolation and then calling exec() on it. With exec() removed, a lot of the standard library turned out to fail on import.

Our initial fix was simply to replace the namedtuples with regular tuples:
[python]
def namedtuple(typename, field_names, verbose=False, rename=False):
return tuple
[/python]
This worked surprisingly well. The parts of the library we were using were still using namedtuples just like regular tuples and all was well.

Recently, however, we found that the urlparse module was making non-trivial use of it so something needed to be done.  My initial reflex was to dive in and reimplement it using a metaclass or some such. But then I thought of asking the internet.

It turns out that this exists as an issue in the Python bug tracker.  Someone else had come across this oddity in the standard library and submitted an alternative implementation.  This works perfectly for our purposes.

I know that there is nothing inherently evil about using exec in Python, but this particular case still doesn’t quite ring true to me:  If the best way to implement a class is by resorting to the meta-language, doesn’t that indicate some shortcoming in the language itself?