What is Stackless?

I sometimes get this question. And instead of starting a rant about microthreads, co-routines, tasklets and channels, I present the essential piece of code from the implementation:

The Code:

/*
    the frame dispatcher will execute frames and manage
    the frame stack until the "previous" frame reappears.
    The "Mario" code if you know that game :-)
 */

PyObject *
slp_frame_dispatch(PyFrameObject *f, PyFrameObject *stopframe, int exc, PyObject *retval)
{
    PyThreadState *ts = PyThreadState_GET();

    ++ts->st.nesting_level;

/*
    frame protocol:
    If a frame returns the Py_UnwindToken object, this
    indicates that a different frame will be run.
    Semantics of an appearing Py_UnwindToken:
    The true return value is in its tempval field.
    We always use the topmost tstate frame and bail
    out when we see the frame that issued the
    originating dispatcher call (which may be a NULL frame).
 */

    while (1) {
        retval = f->f_execute(f, exc, retval);
        if (STACKLESS_UNWINDING(retval))
            STACKLESS_UNPACK(retval);
        /* A soft switch is only complete here */
        Py_CLEAR(ts->st.del_post_switch);
        f = ts->frame;
        if (f == stopframe)
            break;
        exc = 0;
    }
    --ts->st.nesting_level;
    /* see whether we need to trigger a pending interrupt */
    /* note that an interrupt handler guarantees current to exist */
    if (ts->st.interrupt != NULL &&
        ts->st.current->flags.pending_irq)
        slp_check_pending_irq();
    return retval;
}

(This particular piece of code is taken from an experimental branch called stackless-tealet, selected for clarity)

What is it?

It is the frame execution code. A top level loop that executes Python function frames. A “frame” is the code sitting inside a Python function.

Why is it important?

It is important in the way it contrasts to C Python.

Regular C Python uses the C execution stack, mirroring the execution stack of the Python program that it is interpreting. When a Python function foo(), calls a python function bar(), this happens by a recursive invocation of the C function PyEval_EvalFrame(). This means that in order to reach a certain state of execution of a C Python program, the interpreter needs to be in a certain state of recursion.

In Stackless Python, the C stack is decoupled from the Python stack as much as possible. The next frame to be executed is placed in ts->frame and the frame chain is executed in a loop.

This allows two important things:

  1. The state of execution of a Stackless python program can be saved and restored easily. All that is required is the ability to pickle execution frames and other runtime structures (Stackless adds that pickling functionality). The recursion state of a Python program can be restored without having the interpreter enter the same level of C recursion.
  2. Frames can be executed in any order. This allows many tasklets to be created and code that switches between them. Microthreads, if you will. Co-routines, if you prefer that term. But without forcing the use of the generator mechanism that C python has (in fact, generators can be more easily and elegantly implemented using this system).

That’s it!

Stackless Python is stackless, because the C stack has been decoupled from the python stack. Context switches become possible, as well as the dynamic management of execution state.

Okay, there’s more:

  • Stack slicing: A clever way of switching context even when the C stack gets in the way
  • A framework of tasklets and channels to exploint execution context switching
  • A scheduler to keep everything running

Sadly, development and support for Stackless Python has slowed down in the last few years. It however astonishes me that the core idea of stacklessness hasn’t been embraced by C Python even yet.

Killing Considered Benign

I made a short presentation to my colleagues the other day about how we use the killing of tasklets as a clean and elegant way to tear down services and workers in a Stackless Python program.

My colleague Rob Galanakis wrote a short blog post on his impressions of it.

Here are the slides, for those interested.
Death to Tasklets

Atomic

After a long hiatus, the Cosmic Percolator is back in action.  Now it is time to rant about all things Python, I think.  Let’s start with this here, which came out from work I did last year.

Stackless has had an “atomic” feature for a long time. In this post I am going to explain its purpose and how I reacently extended it to make working with OS threads easier.

Scheduling

In Stackless python, scheduling it cooperative.  This means that a tasklet is normally uninterrupted until it explicitly does something that would cause another one to run, like sending a message over a channel.  This allows one to write logic in stackless without worrying too much about synchronization.

However, there is an important exception to this: It is possible to run stackless tasklets throught the watchdog and this will interrupt a running tasklet if it exceeds a pre-determined number of executed opcodes:

while True:
    interrupted = stackless.run(100)
    if interrupted:
        print interrupted, "has been running quite a bit!"
        interrupted.insert()
    else:
        break # Ok, nothing runnable anymore

This code may cause a tasklet to be interrupted at an arbitrary point (actually during a tick interval, the same point that yields the GIL) and cause a switch to the main tasklet.

Of course, not all code uses this execution mode, but never the less, it has always been considered a good idea to be aware of this.  For this reason, an atomic mode has been supported which would inhibit this involuntary switching in sensitive areas:

oldvalue = stackless.getcurrent().set_atomic(1)
try:
    myglobalvariable1 += 1
    myglobalvariable2 += 2
finally:
    stackless.getcurrent().set_atomic(oldvalue)

The above is then optionally wrapped in a context manager for readability:

@contextlib.contextmanager
def atomic()
    oldv = stackless.getcurrent().set_atomic(1)
    try:
        yield None
    finally:
        stackless.getcurrent().set_atomic(old)

the atomic state is a property of each tasklet and so even when there is voluntary switching performed while a non-zero atomic state is in effect, it has no effect on other tasklets.  Its only effect is to inhibit involuntary switching of the tasklet on which it is set.

A Concrete Example

To better illustrate its use, lets take a look at the implementation of the Semaphore from stacklesslib (stacklesslib.locks.Semaphore):

class Semaphore(LockMixin):
    def __init__(self, value=1):
        if value < 0:
            raise ValueError
        self._value = value
        self._chan = stackless.channel()
        set_channel_pref(self._chan)

    def acquire(self, blocking=True, timeout=None):
        with atomic():
            # Low contention logic: There is no explicit handoff to a target,
            # rather, each tasklet gets its own chance at acquiring the semaphore.
            got_it = self._try_acquire()
            if got_it or not blocking:
                return got_it

            wait_until = None
            while True:
                if timeout is not None:
                    # Adjust time.  We may have multiple wakeups since we are a
                    # low-contention lock.
                    if wait_until is None:
                        wait_until = elapsed_time() + timeout
                    else:
                        timeout = wait_until - elapsed_time()
                        if timeout < 0:
                            return False
                try:
                    lock_channel_wait(self._chan, timeout)
                except:
                    self._safe_pump()
                    raise
                if self._try_acquire():
                    return True

    def _try_acquire(self):
        if self._value > 0:
            self._value -= 1
            return True
        return False

This code illustrates how the atomic state is incremented (via a context manager) and kept non-zero while we are doing potentially sensitive things, in this case, doing logic based on self._value. Since this is code that is used for implementing a Semaphore, which itself forms the basis of other stacklesslib.locks objects such as CriticalSection and Condition objects, this is the only way we have to ensure atomicity.

Threads

It is worth noting that using the atomic property has largely been confined to such library code as the above. Most stackless programs indeed do not run the watchdog in interruptible mode, or they use the so-called soft-interrupt mode which breaks the scheduler only at the aforementioned voluntary switch points.

However, in the last two years or so, I have been increasingly using Stackless Python in conjunction with OS threads.  All the stackless constructs, such as channels and tasklets work with threads, with the caveat that synchronized rendezvous isn’t possible between tasklets of different threads.  A channel.send() where the recipient is a tasklet from a different thread from the sender will always cause the target to become runnable in that thread, rather than to cause immediate switching.

Using threads has many benefits.  For one, it simplifies certain IO operations.  Handing a job to a tasklet on a different thread won’t block the main thread.  And using the usual tasklet communication channels to talk uniformly to all tasklets, whether they belong to this thread or another, makes the architecture uniform and elegant.

The locking constructs in stacklesslib also all make use of non-immediate scheduling.  While we use the stackless.channel object to wait, we make no assumptions about immediate execution when a target is woken up.  This makes them usable for synchronization between tasklets of different threads.

Or, this is what I thought, until I started getting strange errors and realized that tasklet.atomic wasn’t inhibiting involuntary switching between threads!

The GIL

You see, Python internally can arbitrarily stop executing a particular thread and start running another.  This is called yielding the GIL and it happens at the same part in the evaluation loop as that involuntary breaking of a running tasklet would have been performed.  And stackless’ atomic property din’t affect this behaviour.  If the python evaluation loop detects that another thread is runnable and waiting to execute python code, it may arbitrariliy yield the GIL to that thread and wait to reacquire the GIL again.

When using the above lock to synchronize tasklets from two threads, we would suddenly have a race condition, because the atomic context manager would no longer prevent two tasklets from making simultaneous modifications to self._value, if those tasklets came belonged to different threads.

A Conundrum

So, how to fix this?  An obvious first avenue to explore would be to use one of the threading locks in addition to the atomic flag.  For the sake of argument, let’s illustrate with a much simplified lock:

class SimpleLock(object):
    def __init__(self):
        self._chan = stackless.channel()
        self._chan.preference = 0 # no preference, receiver is made runnable
        self._state = 0

    def acquire(self):
        # oppertunistic lock, without explicit handoff.
        with atomic():
            while True:
                if self._state == 0:
                    self._state = 1:
                    return
                self._chan.receive()
    def release():
        with atomic():
            self._state == 0
            if self._chan.balance():
                self._chan.send(None) # Wake up someone who is waiting

While this lock will work nicely with tasklets on the same thread. But when we try to use it for locking between two threads, the atomicity of changing self._state and examining self._chan.balance() won’t be maintained.

We can try to fix this with a proper thread lock:

class SimpleLockedLock(object):
    def __init__(self):
        self._chan = stackless.channel()
        self._chan.preference = 0 # no preference, receiver is made runnable
        self._state = 0
        self._lock = threading.Lock()

    def acquire(self):
        # oppertunistic lock, without explicit handoff.
        with atomic():
            while True:
                with self._lock:
                    if self._state == 0:
                        self._state = 1:
                        return
                self._chan.receive()
    def release():
        with atomic():
            with self._lock:
                self._state == 0
                if self._chan.balance():
                    self._chan.send(None) # Wake up someone who is waiting

This version is more cumbersome, of course, but the problem is, that it doesn’t really fix the issue. There is still a race condition in acquire(), between relesing self._lock and calling self._chan.receive().

Even if we were to modify self.chan.receive() to take a lock and atomically release it before blocking, and reaquire it before returning, that would be a very unsatisfying solution.

thankfully, since we needed to go and modify Stackless Python anyway, there was a much simpler solution.

Fixing Atomic

You see, Python is GIL synchronized.  In the same way that only one tasklet of a particular thread is executing at the same time,  then regular cPython is has the GIL property that only one of the processes thread is runinng python code at a time.  So, at any one time, only one tasklet of one thread is running python code.

So, if atomic can inhibit involuntary switching between tasklets of the same threads, can’t we just extend it to inhibit involuntary switching between threads as well?  Jessörry Bob, it turns out we can.

This is the fix (ceval.c:1166, python 2.7):

/* Do periodic things.  Doing this every time through
the loop would add too much overhead, so we do it
only every Nth instruction.  We also do it if
``pendingcalls_to_do'' is set, i.e. when an asynchronous
event needs attention (e.g. a signal handler or
async I/O handler); see Py_AddPendingCall() and
Py_MakePendingCalls() above. */
#ifdef STACKLESS
/* don't do periodic things when in atomic mode */
if (--_Py_Ticker < 0 && !tstate->st.current->flags.atomic) {
#else
if (--_Py_Ticker < 0) {
#endif

That’s it! Stackless’ atomic flag has been extended to also stop the involuntary yielding of the GIL from happening.  Of course voluntary yielding, such as that which is done when performing blocking system calls, is still possible, much like voluntary switching between tasklets is also possible.  But when the tasklet’s atomic value is non-zero, this guarantees that no unexpected switch to another tasklet, be it on this thread or another, happens.

This fix, dear reader, was sufficient to make sure that all the locking constructs in stacklesslib worked for all tasklets.

So, what about cPython?

It is worth noting that the locks in stacklesslib.locks can be used to replace the locks in threading.locks:  If your program is just a regular threaded python program, then it will run correctly with the locks from stacklesslib.locks replacing the ones in threading.locks.  This includes, Semaphore, Lock, RLock, Condition, Barrier, Event and so on.  and all of them are now written in Python-land using regular Python constructs and made to work by the grace of the extended tasklet.atomic property.

Which brings me to ask the question: Why doesn’t cPython have the thread.atomic property?

I have seen countless questions on the python-dev mailing lists about whether this or that operation is atomic or not.  Regularly one sees implementation changes to for example list and dict operations to add a new requirement that an operation be atomic wrt. thread switches.

Wouldn’t it be nice if the programmer himself could just say: “Ah, I’d like to make sure that my updating this container here will be atomic when seen from the other threads.  Let’s just use the thread.atomic flag for that.”

For cPython, this would be a perfect light-weight atomic primitive.  It would be very useful to synchronize access to small blocks of code like this.  For other implementations of Python, those that are truly GIL free, a thread.atomic property could be implemented with a single system global threading.RLock. Provided that we add the caveat to a thread.atomic that it should be used by all agents accessing that data, we would now have a system for mutual access that wold work very cheaply on cPython and also work (via a global lock) on other implementations.

Let’s add thread.atomic to cPython

The reasons I am enthusiastic about seeing an “atomic” flag as part of cPython are twofold:

  1. It would fill the role of a lightweight synchronization primitive that people are requesting where a true Lock is considered too expensive, and where it makes no sense to have a per-instance lock object.
  2. More importantly, it will allow Stackless functionality to be added to cPython as a pure extension module, and it will allow such inter-thread operations to be added to Greenlet-based programs in the same way as we have solved the problem for Stackless Python.
  3. And thirdly?  Because Debbie Harry says so:

 Update, 23.03.2013:

Emulating an “atomic” flag in an truly multithreaded environment with a lock is not as simple as I first though.  The cool thing about “atomic” is that it still allows the thread to block, e.g. on an IO operation, without affecting other threads.  For an atomic-like lock to work, such a lock would need to be automatically yielded and re-acquired when blocking, bringing us back to a condition-variable-like model.  Since the whole purpose of “atomic” is to be lightweight in a GIL-like environment, forcing it to be backwards compatible with a truly multi-threaded solution is counter-productive.  So, “atomic” as a GIL only feature is the only thing that makes sense, for now.  Unless I manage to dream up an alternative.

Killing a Stackless bug

What follows is an account of how I found and fixed an insidious bug in Stackless Python which has been there for years.  It’s one of those war stories.  Perhaps a bit long winded and technical and full of exaggerations as such stories tend to be.

Background

Some weeks ago, because of a problem in the client library we are using, I had to switch the http library we are using on the PS3 from using non-blocking IO to blocking. Previously, we were were issuing all the non-blocking calls, the “select” and the tasklet blocking / scheduling on the main thread. This is similar to how gevent and other such libraries do things. Switching to blocking calls, however, meant doing things on worker threads.

The approach we took was to implement a small pool of pyton workers which could execute arbitrary jobs. A new utility function, stacklesslib.util.call_async() then performed the asynchronous call by dispatching it to a worker thread. The idea of an call_async() is to have a different tasklet execute the callable while the caller blocks on a channel. The return value, or error, is then propagated to the originating tasklet using that channel. Stackless channels can be used to communicate between threads too. And synchronizing threads in stackless is even more conveninent than regular Python because there is stackless.atomic, which not only prevents involuntary scheduling of tasklets, it also prevents automatic yielding of the GIL (cPython folks, take note!)

This worked well, and has been running for some time. The drawback to this approach, of course, is that we now need to keep python threads around, consuming stack space. And Python needs a lot of stack.

The problem

The only problem was, that there appeared to be a bug present. One of our developers complained that sometimes, during long downloads, the http download function would return None, rather than the expected string chunk.

Now, this problem was hard to reproduce. It required a specific setup and geolocation was also an issue. This developer is in California, using servers in London. Hence, there ensued a somewhat prolonged interaction (hindered by badly overlapping time-zones) where I would provide him with modified .py files with instrumentation, and he would provide me with logs. We quickly determined, to my dismay, that apparently, sometimes a string was turning into None, while in transit trough a channel.send() to a channel.receive(). This was most distressing. Particularly because the channel in question was transporting data between threads and this particular functionality of stackless has not been as heavily used as the rest.

Tracking it down

So, I suspected a race condition of some sorts. But a careful review of the channel code and the scheduling code presented no obvious candidates. Also, the somehwat unpopular GIL was being used throughout, which if done correctly ensures that things work as expected.

To cut a long story short, by a lucky coincidence I managed to reproduce a different manifestation of the problem. In some cases, a simple interaction with a local HTTP server would cause this to happen.

When a channel sends data between tasklets, it is temporarily stored on the target tasklet’s “tempval” attribute. When the target wakes up, this is then taken and returned as the result from the “receive()” call. I was able to establish that after sending the data, the target tasklet did indeed hold the correct string value in its “tempval” attribute. I then needed to find out where and why it was disappearing from that place.

By adding instrumentation code to the stackless core, I established that this was happening in the last line of the following snippet:

PyObject *
slp_run_tasklet(void)
{
    PyThreadState *ts = PyThreadState_GET();
    PyObject *retval;

    if ( (ts->st.main == NULL) && initialize_main_and_current()) {
        ts->frame = NULL;
        return NULL;
    }

    TASKLET_CLAIMVAL(ts->st.current, &retval);

By setting a breakpoint, I was able to see that I was in the top level part of the “continue” bit of the “stack spilling” code

Stack spilling is a feature of stackless where the stack slicing mechanism is used to recycle a deep callstack. When it detects that the stack has grown beyond a certain limit, it is stored away, and a hard switch is done to the top again, where it continues its downwards crawl. This can help conserve stack address space, particularly on threads where the stack cannot grow dynamically.

So, something wrong with stack spilling, then.  But even so, this was unexpected. Why was stack spilling happening when data was being transmitted across a channel? Stack spilling normally occurs only when nesting regular .py code and other such things.

By setting a breakpoint at the right place, where the stack spilling code was being invoked, I finally arrived at this callstack:

Type Function
PyObject* slp_eval_frame_newstack(PyFrameObject* f, int exc, PyObject* retval)
PyObject* PyEval_EvalFrameEx_slp(PyFrameObject* f, int throwflag, PyObject* retval)
PyObject* slp_frame_dispatch(PyFrameObject* f, PyFrameObject* stopframe, int exc, PyObject* retval)
PyObject* PyEval_EvalCodeEx(PyCodeObject* co, PyObject* globals, PyObject* locals, PyObject** args, int argcount, PyObject** kws, int kwcount, PyObject** defs, int defcount, PyObject* closure)
PyObject* function_call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_Call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_CallFunctionObjArgs(PyObject* callable)
void PyObject_ClearWeakRefs(PyObject* object)
void tasklet_dealloc(PyTaskletObject* t)
void subtype_dealloc(PyObject* self)
int slp_transfer(PyCStackObject** cstprev, PyCStackObject* cst, PyTaskletObject* prev)
PyObject* slp_schedule_task(PyTaskletObject* prev, PyTaskletObject* next, int stackless, int* did_switch)
PyObject* generic_channel_action(PyChannelObject* self, PyObject* arg, int dir, int stackless)
PyObject* impl_channel_receive(PyChannelObject* self)
PyObject* call_function(PyObject*** pp_stack, int oparg)

Notice the “subtype_dealloc”. This callstack indicates that in the channel receive code, after the hard switch back to the target tasklet, a Py_DECREF was causing side effects, which again caused stack spilling to occur. The place was this, in slp_transfer():

/* release any objects that needed to wait until after the switch. */
Py_CLEAR(ts->st.del_post_switch);

This is code that does cleanup after tasklet switch, such as releasing the last remaining reference of the previous tasklet.

So, the bug was clear then. It was twofold:

  1. A Py_CLEAR() after switching was not careful enough to store the current tasklet’s “tempval” out of harms way of any side-effects a Py_DECREF() might cause, and
  2. Stack slicing itself, when it happened, clobbered the current tasklet’s “tempval”

The bug was subsequently fixed by repairing stack spilling and spiriting “tempval” away during the Py_CLEAR() call.

Post mortem

The inter-thread communication turned out to be a red herring. The problem was caused by an unfortunate juxtaposition of channel communication, tasklet deletion, and stack spilling.
But why had we not seen this before? I think it is largely due to the fact that stack spilling only rarely comes into play on regular platforms. On the PS3, we deliberately set the threshold low to conserve memory space. This is also not the first stack-spilling related bug we have seen on the PS3, but the first one for two years. Hopefully it will be the last.

Since this morning, the fix is in the stackless repository at http://hg.python.org/stackless

Evaluating Nagare

Introduction

A little known feature of EVE Online, disabled in the client but very much active in the server, is a web server.  This was added early in the development, before I started on the project back in 2003.  It is the main back-end access point to the game server, used for all kinds of management, status information and debugging.

Back then, Python was much less mature as a web serving framework.  Also we initially wanted just very rudimentary functionality.  So we wrote our own web server.  It, and the site it presents, were collectively called ESP, for Eve Server Pages and over the years it has grown in features and content.  The heaviest use that it sees is as the dashboard for Game Managers, where everything a GM needs to do is done through HTML pages.  It is also one of the main tools for content authoring, where game designers access a special authoring server.  ESP presents a content authoring interface that then gets stored in the backend database.

Recently we have increasingly started to look for alternatives to our homegrown HTTP solution, though.  The main reasons are:

  1. We want to use a standard Python web framework with all the bells and whistles and support that such frameworks offer
  2. We want modern Web 2.0 features without having to write them ourselves
  3. We want something that our web developers can be familiar with already
  4. We want to share expertise between the ESP pages and other web projects run by CCP.  Confluence of synergies and all that.

Stackless Python

EVE Online is based on Stackless Python.  Embedded into the the game engine is a locally patched version of Stackless Python 2.7.  We have been using Stackless since the very beginning, the existence of Stackless being the reason we chose Python as a scripting solution for our game.

Systems like the web server have always depended heavily on the use of Stackless.  Using it we have been able to provide a blocking programming interface to IO which uses asynchronous IO behind the scenes.  This allows for a simple and intuitive programming interface with good performance and excellent scalability.  For some years now we use the in-house developed StacklessIO solution which provides an alternative implementation of the socket module.  By also providing emulation of the threading module by using tasklets instead of threads, many off the shelf components simply work out of the box.  As an example, the standard library’s xmlrpc module, itself based on the socketserver module, just works without modification.

Of course, the use of Stackless python is not limited to IO.  A lot of the complicated game logic takes advantage of its programming model, having numerous systems that run as their own tasklets to do different things.  As with IO, this allows for a more intuitive programming experience.  We can write code in an imperative manner where more traditional solutions would have to rely on event driven approaches, state machines and other patterns that are better left for computers than humans to understand.  This makes CCP very much a Stackless Python shop and we are likely to stay that way for quite a bit.

Nagare

It was therefore with a great deal of interest that we noticed the announcement of Nagare on the Stackless mailing list a few years ago.

Nagare promises a different approach to web development.  Instead of web applications that are in effect giant event handlers (responding to the HTTP connectionless requests) it allows the user to write the web applications imperatively much as one would write desktop applications.  Their web site has a very interesting demo portal with a number of applications demonstrating their paradigm, complete with running apps and source code.

This resonates well with me.  I am of the opinion that the programmer is the slowest part of software development.  Anything that a development environment can do to make a programmer able to express himself in familiar, straight-forward manner is a net win.  This is why we we use tasklet blocking tasklet IO instead of writing a game based on something as befuddling as Twisted.  And for this reason I thought it worthwhile to see if a radical, forward thinking approach to web development might be right for CCP.

Tasklet pickling

Nagare achieves its magic by using a little known Stackless feature called tasklet pickling.  A tasklet that isn’t running can have its execution state pickled and stored as binary data.  The pickle contains the execution frames, local variables and other such things.

Stackless Python contains some extensions to allow pickling of various objects whose pickling isn’t supported in regular Python.  These include frames, modules and generators among other things.

When a Nagare web application reaches the point where interaction with the user is required, its state is pickled and stored with the Nagare server, and a HTTP response sent back to the client.  When a subsequent request arrives, the tasklet is unpickled and its execution continues.  From the programmer’s point of view, a function call simply took a long time.  Behind the scenes, Stackless and Nagare were working its magic, making the inherently stateless HTTP protocol seem like smooth program flow to the application.

IO Model

In other ways, Nagare is a very traditional Python web framework.  It is based around WSGI and so uses whatever threading and IO model provided to it by a WSGI server.

Application Model

Unlike some smaller frameworks, Nagare is designed and distributed as a complete web server.  One typically installs it as the single application of a virualenv root, and then configures and runs it using a script called nagare_admin.  This is very convenient for someone simply running a web server, but it becomes less obvious how to use it as a component in a larger application.

Our tests

What we were interested in doing was to see if Nagare would work within the application that is EVE.  To do this we would have to:

  1. Extract Nagare and its dependencies as a set of packages that can be used with the importing framework that EVE uses.  As I have blogged about before, Python by default assumes a central package directory and doesn’t lend itself well to isolation by embedded applications.  We have done that with EVE, however, and would want nagare  to be an “install free” part of our application.
  2. Set up the necessary WSGI infrastructure within EVE for Nagare to run on
  3. Configure and run Nagare programmatically rather than by using the top level application scripts provided by the Nagare distribution.

I specifically didn’t intend to evaluate Nagare from the web developer’s point of view.  This is because I am not one of those and find the whole domain rather alien.  This would be a technical evaluation from an embedding point of view.

Extracting

My first attempt at setting up Nagare was to fetch the source from its repository.

 svn co svn://www.nagare.org/trunk/nagare

I then intended to fetch any dependencies in a similar manual manner.  However, I soon found that to be a long and painstaking process.  In the end I gave up and used the recommended approach:  Set up a virtualenv and use:

<NAGARE_HOME>Scriptseasy_install.exe nagare

This turned out to install a lot of stuff.  This is the basic install and a total of 13 packages were installed in addition to Nagare itself, a total of almost 12Mb. The full install of Nagare increases this to a total of 22 packages and 22Mb.

The original plan was to take this and put it in a place where EVE imports its own files from.  But because of the amount of files in question, we ended up keeping them in place, and hacking EVE to import from <NAGARE_HOME>Libsite-packages.

Setting up WSGI

Nagare requires Paste and can make use of the Paste WSGI server.   Fortunately, EVE already has a built-in WSGI service, based on Paste.  It uses the standard socket module, monkeypatched to use StacklessIO tasklet-blocking sockets.  So this part is easy.

Fitting it together

This is where it finally got tricky.  Normally, Nagare is a stand-alone application, managed with config files.  A master script, nagare_admin, then reads those config files and assembles the various Nagare components into a working application.  Unfortunately, documentation about how to do this programmatically was lacking.

However, the good people of Nagare were able to help me out with the steps I needed to take, thus freeing me from having to reverse-engineer a lot of configuration code.  What I needed to do was to create a Publisher, a Session manager and the Nagare application I need to run.  For testing purposes I just wanted to run the admin app that comes with Nagare.

After configuring EVE’s importer to find Nagare and its dependencies in its default install location, the code I ended up with was this:

#Instantiate a Publisher (WSGI application)
from nagare.publishers.common import Publisher
p = Publisher()

#Register the publisher as a WSGI app with our WSGI server
sm.StartService('WSGIService').StartServer(p.urls, 8080) #our WSGI server

#instantiate a simple session manager
from nagare.sessions.memory_sessions import SessionsWithMemoryStates
sm = SessionsWithMemoryStates()

#import the app and configure it
from nagare.admin import serve, util, admin_app
app = admin_app.app
app.set_sessions_manager(sm)

#register the app with the publisher
p.register_application('admin', 'admin', app, app)

#register static resources
def lookup(r, path=r"D:nagareLibsite-packagesnagare-0.3.0-py2.5.eggstatic"):
    return serve.get_file_from_root(path, r)
p.register_static('nagare', lookup)

This gave the expected result. Browsing to port 8080 gave this image:

So, success!  EVE was serving a Nagare app from its backend.

Conclusion

These tests showed that Nagare does indeed work as a backend webserver for EVE.  In particular, the architecture of StaclessIO sockets allows most socket-based applications to just work out of the box.

Also, because Nagare is a Python package, it is inherently programmable.  So, it is possible to configure it to be a part of a larger application, rather than the stand-alone application that it is primarily designed to be.  Using Nagare as a library and not an application, however, wasn’t well documented and I had to have some help from its friendly developers and read the source code to get it to work.

On the other hand, Nagare is a large application.  Not only is Nagare itself a substantial package, it also has a lot of external dependencies.  For an embedded application of Python, such as a computer game, this is a serious drawback.   We like to be very selective about what modules we make available within EVE.  The reasons range from the purely practical (space restraints, versioning hell, build management complexity) to externally driven issues like security and licensing.

It is for this reason that we ultimately decided that Nagare wasn’t right for us as part of the EVE backend web interface.  The backend web interface started out as a minimal HTTP server and we want to keep it as slim as possible.  We are currently in the process of picking and choosing some standard WSGI components and writing special case code for our own use.  This does, however, mean that we miss out on the cool web programming paradigm that is Nagare within EVE.

Evaluating Nagare

Introduction

A little known feature of EVE Online, disabled in the client but very much active in the server, is a web server.  This was added early in the development, before I started on the project back in 2003.  It is the main back-end access point to the game server, used for all kinds of management, status information and debugging.

Back then, Python was much less mature as a web serving framework.  Also we initially wanted just very rudimentary functionality.  So we wrote our own web server.  It, and the site it presents, were collectively called ESP, for Eve Server Pages and over the years it has grown in features and content.  The heaviest use that it sees is as the dashboard for Game Managers, where everything a GM needs to do is done through HTML pages.  It is also one of the main tools for content authoring, where game designers access a special authoring server.  ESP presents a content authoring interface that then gets stored in the backend database.

Recently we have increasingly started to look for alternatives to our homegrown HTTP solution, though.  The main reasons are:

  1. We want to use a standard Python web framework with all the bells and whistles and support that such frameworks offer
  2. We want modern Web 2.0 features without having to write them ourselves
  3. We want something that our web developers can be familiar with already
  4. We want to share expertise between the ESP pages and other web projects run by CCP.  Confluence of synergies and all that.

Stackless Python

EVE Online is based on Stackless Python.  Embedded into the the game engine is a locally patched version of Stackless Python 2.7.  We have been using Stackless since the very beginning, the existence of Stackless being the reason we chose Python as a scripting solution for our game.

Systems like the web server have always depended heavily on the use of Stackless.  Using it we have been able to provide a blocking programming interface to IO which uses asynchronous IO behind the scenes.  This allows for a simple and intuitive programming interface with good performance and excellent scalability.  For some years now we use the in-house developed StacklessIO solution which provides an alternative implementation of the socket module.  By also providing emulation of the threading module by using tasklets instead of threads, many off the shelf components simply work out of the box.  As an example, the standard library’s xmlrpc module, itself based on the socketserver module, just works without modification.

Of course, the use of Stackless python is not limited to IO.  A lot of the complicated game logic takes advantage of its programming model, having numerous systems that run as their own tasklets to do different things.  As with IO, this allows for a more intuitive programming experience.  We can write code in an imperative manner where more traditional solutions would have to rely on event driven approaches, state machines and other patterns that are better left for computers than humans to understand.  This makes CCP very much a Stackless Python shop and we are likely to stay that way for quite a bit.

Nagare

It was therefore with a great deal of interest that we noticed the announcement of Nagare on the Stackless mailing list a few years ago.

Nagare promises a different approach to web development.  Instead of web applications that are in effect giant event handlers (responding to the HTTP connectionless requests) it allows the user to write the web applications imperatively much as one would write desktop applications.  Their web site has a very interesting demo portal with a number of applications demonstrating their paradigm, complete with running apps and source code.

This resonates well with me.  I am of the opinion that the programmer is the slowest part of software development.  Anything that a development environment can do to make a programmer able to express himself in familiar, straight-forward manner is a net win.  This is why we we use tasklet blocking tasklet IO instead of writing a game based on something as befuddling as Twisted.  And for this reason I thought it worthwhile to see if a radical, forward thinking approach to web development might be right for CCP.

Tasklet pickling

Nagare achieves its magic by using a little known Stackless feature called tasklet pickling.  A tasklet that isn’t running can have its execution state pickled and stored as binary data.  The pickle contains the execution frames, local variables and other such things.

Stackless Python contains some extensions to allow pickling of various objects whose pickling isn’t supported in regular Python.  These include frames, modules and generators among other things.

When a Nagare web application reaches the point where interaction with the user is required, its state is pickled and stored with the Nagare server, and a HTTP response sent back to the client.  When a subsequent request arrives, the tasklet is unpickled and its execution continues.  From the programmer’s point of view, a function call simply took a long time.  Behind the scenes, Stackless and Nagare were working its magic, making the inherently stateless HTTP protocol seem like smooth program flow to the application.

IO Model

In other ways, Nagare is a very traditional Python web framework.  It is based around WSGI and so uses whatever threading and IO model provided to it by a WSGI server.

Application Model

Unlike some smaller frameworks, Nagare is designed and distributed as a complete web server.  One typically installs it as the single application of a virualenv root, and then configures and runs it using a script called nagare_admin.  This is very convenient for someone simply running a web server, but it becomes less obvious how to use it as a component in a larger application.

Our tests

What we were interested in doing was to see if Nagare would work within the application that is EVE.  To do this we would have to:

  1. Extract Nagare and its dependencies as a set of packages that can be used with the importing framework that EVE uses.  As I have blogged about before, Python by default assumes a central package directory and doesn’t lend itself well to isolation by embedded applications.  We have done that with EVE, however, and would want nagare  to be an “install free” part of our application.
  2. Set up the necessary WSGI infrastructure within EVE for Nagare to run on
  3. Configure and run Nagare programmatically rather than by using the top level application scripts provided by the Nagare distribution.

I specifically didn’t intend to evaluate Nagare from the web developer’s point of view.  This is because I am not one of those and find the whole domain rather alien.  This would be a technical evaluation from an embedding point of view.

Extracting

My first attempt at setting up Nagare was to fetch the source from its repository.

 svn co svn://www.nagare.org/trunk/nagare

I then intended to fetch any dependencies in a similar manual manner.  However, I soon found that to be a long and painstaking process.  In the end I gave up and used the recommended approach:  Set up a virtualenv and use:

<NAGARE_HOME>Scriptseasy_install.exe nagare

This turned out to install a lot of stuff.  This is the basic install and a total of 13 packages were installed in addition to Nagare itself, a total of almost 12Mb. The full install of Nagare increases this to a total of 22 packages and 22Mb.

The original plan was to take this and put it in a place where EVE imports its own files from.  But because of the amount of files in question, we ended up keeping them in place, and hacking EVE to import from <NAGARE_HOME>Libsite-packages.

Setting up WSGI

Nagare requires Paste and can make use of the Paste WSGI server.   Fortunately, EVE already has a built-in WSGI service, based on Paste.  It uses the standard socket module, monkeypatched to use StacklessIO tasklet-blocking sockets.  So this part is easy.

Fitting it together

This is where it finally got tricky.  Normally, Nagare is a stand-alone application, managed with config files.  A master script, nagare_admin, then reads those config files and assembles the various Nagare components into a working application.  Unfortunately, documentation about how to do this programmatically was lacking.

However, the good people of Nagare were able to help me out with the steps I needed to take, thus freeing me from having to reverse-engineer a lot of configuration code.  What I needed to do was to create a Publisher, a Session manager and the Nagare application I need to run.  For testing purposes I just wanted to run the admin app that comes with Nagare.

After configuring EVE’s importer to find Nagare and its dependencies in its default install location, the code I ended up with was this:

#Instantiate a Publisher (WSGI application)
from nagare.publishers.common import Publisher
p = Publisher()

#Register the publisher as a WSGI app with our WSGI server
sm.StartService('WSGIService').StartServer(p.urls, 8080) #our WSGI server

#instantiate a simple session manager
from nagare.sessions.memory_sessions import SessionsWithMemoryStates
sm = SessionsWithMemoryStates()

#import the app and configure it
from nagare.admin import serve, util, admin_app
app = admin_app.app
app.set_sessions_manager(sm)

#register the app with the publisher
p.register_application('admin', 'admin', app, app)

#register static resources
def lookup(r, path=r"D:nagareLibsite-packagesnagare-0.3.0-py2.5.eggstatic"):
    return serve.get_file_from_root(path, r)
p.register_static('nagare', lookup)

This gave the expected result. Browsing to port 8080 gave this image:

So, success!  EVE was serving a Nagare app from its backend.

Conclusion

These tests showed that Nagare does indeed work as a backend webserver for EVE.  In particular, the architecture of StaclessIO sockets allows most socket-based applications to just work out of the box.

Also, because Nagare is a Python package, it is inherently programmable.  So, it is possible to configure it to be a part of a larger application, rather than the stand-alone application that it is primarily designed to be.  Using Nagare as a library and not an application, however, wasn’t well documented and I had to have some help from its friendly developers and read the source code to get it to work.

On the other hand, Nagare is a large application.  Not only is Nagare itself a substantial package, it also has a lot of external dependencies.  For an embedded application of Python, such as a computer game, this is a serious drawback.   We like to be very selective about what modules we make available within EVE.  The reasons range from the purely practical (space restraints, versioning hell, build management complexity) to externally driven issues like security and licensing.

It is for this reason that we ultimately decided that Nagare wasn’t right for us as part of the EVE backend web interface.  The backend web interface started out as a minimal HTTP server and we want to keep it as slim as possible.  We are currently in the process of picking and choosing some standard WSGI components and writing special case code for our own use.  This does, however, mean that we miss out on the cool web programming paradigm that is Nagare within EVE.

_ssl modifications in 2.7

I recently pushed into action a plan that had been brewing for long while: To make the SSL module in the standard library work for our StacklessIO socket library.

The problem with _ssl is that it just uses internally an native socket BIO.  It doesn’t allow the application to control details of the communications.  This is unsuitable for anyone that is using a non-standard socket implementation.

I admit that I didn’t look around, otherwise I probably would have stumbled across pyOpenSSL I simply assumed that the standard library ssl was the only one and that was clearly unsuitable since it does its own socket IO.

So, anyway, what I did was to add two different features to _ssl.sslwrap():

  1. Instead of only allowing wrapping a socket, one can pass in a Python object. This is assumed to be a PyBIO object, an object that implements the read() and write() methods. the SSLContext object thus created will internally create a BIO pair object and then any “read” or “write” calls on it may invoke corresponding Python callbacks to the PyBIO object to satisfy the SSL context’s need to send or receive data.
    This feature can be used to send the data over whatever Python IO channel one chooses to implement in Python.
  2. Additionally, the user may wrap None. In this case, a “naked” SSLContext will be created, with a BIO pair. This is then suitable for use by any other C code that knows about Open SSL, the layout of PySSLObject and how to pump data back and forth out of a BIO pair. To facilitate this, I export the _ssl api in the same way that _socket does (to _ssl), using a new function PySSLModule_ImportModuleAndAPI(void).
    The purpose of this feature is to be able to use standard Python code to manage certificates and set up the SSL contex, then hand off this context to a custom transport layer written in C/C++.

This is now complete. The changes were not large. I have written extenstions to the test_ssl.py unittests that wrap a regular socket in a PyBIO and the entire thing works.

We are now using this in EVE, to add SSL support to the backend webserver. We need to pipe data trough the _ssl module and back into our code where stackless stack swapping magic happens to emulate blocking IO.

The question that remains, is this: I´d like to contribute this stuff back, but due to Python 2.x being feature frozen, there is no good place for it. Maybe I could push this into 3.x but I haven’t looked. Maybe it does ssl differently.

So it remains, like many other goodies, part of the custom branch of stackless 2.7 that CCP uses. For the time being.

Winsock timeout / closesocket()

I was working on StacklessIO for Stackless Python. When running the test_socket unittest, I came across a single failure: The testInsideTimeout would fail, with the server receiving a ECONNRESET when trying to write its “done!” string.

Normally this shouldn’t happen. Because even though the client has closed the connection, a RST packet is only sent in response to the send() call, so at the time of calling send all appears well and the call should succeed.

Investigating this further, it turned out to be due to the way I’m implementing timeout.

StacklessIO uses Winsock overlapped IO.  A request is issued and then when it is finished a thread waiting on a IOCompletion port will wake up and cause the waiting tasklet to be resumed.  To time out, for example, a recv() call, I schedule a Windows timer as well.  If it fires before the request is done, the tasklet is woken up with a timeout error.  There appears to be no way in the API to cancel a pending IO request, so at this point, the IO is still pending.

Anyway, this all is well and good, but where does the RST come from then?  Well, when the timeout occurs, the tasklet wakes up and the socket is closed.  And calling closesocket() on a connection with pending IO has at least two effects, only one of which is documented:

  1. All pending IO is canceled with the WSA_OPERATION_ABORTED error.
  2. A RST is sent to the remote party

I’ve never seen the latter behaviour documented.  But apparently then, calling closesocket() when IO is pending is equivalent to an abortive close().

I’m not sure if this is significant.  If a socket call times out, the usual recommendation is to close the connection anyway since the connection may be in an undefined state due to race conditions.  But it is a bit annoying all the same.

selectmodule on PS3

As part of a game that we are developing on the Sony PS3, we are porting Stackless Python 2.7 to run on it.  What would be more natural?

Python is to do what it does best, act as a supervising puppetmaster, running high-level logic that you don’t want to code using something primitive such as C++.

While there are many issues that we have come across and addressed (most are minor), I’m just going to mention a particular issue with sockets.

An important role of Python will be to manage network connection using sockets.  We are using the Stacklesssocket module with Stackless Python do do this.  For this to work, we need not only the socketmodule but also the selectmodule.

The PS3 networking api is mostly BSD compliant with a few quirks.  Porting these modules to use it is mostly a straightforward affair of providing #defines for some API calls that have different names, dealing with APIs that are missing (return NotImplementedError and such) and mapping error values to something sensible.  For the most part, it behaves exactly as you would expect.

We ran into one error though, prompting me to write special handling code for select() and poll().  the Sony implementation of these functions not only returns with an error if they themselves fail (which would be serious) but they also indicate a error return if a socket error is pending on one of the sockets.  For example, if a socket receives a ECONNRESET, while you are waiting for data to arrive, select() will return with an ECONNRESET error indicator.  Not what you would expect.

The workaround is to simply filter the error values from select() and poll() and ignore the unexpected socket errors.  Rather, such a return must be considered a successful select()/poll() and for the latter function, ‘n’, the number of valid file descriptors, having been set as -1, must be recreated by walking the list of file descriptors.