After a long hiatus, the Cosmic Percolator is back in action. Now it is time to rant about all things Python, I think. Let’s start with this here, which came out from work I did last year.
Stackless has had an “atomic” feature for a long time. In this post I am going to explain its purpose and how I reacently extended it to make working with OS threads easier.
Scheduling
In Stackless python, scheduling it cooperative. This means that a tasklet is normally uninterrupted until it explicitly does something that would cause another one to run, like sending a message over a channel. This allows one to write logic in stackless without worrying too much about synchronization.
However, there is an important exception to this: It is possible to run stackless tasklets throught the watchdog and this will interrupt a running tasklet if it exceeds a pre-determined number of executed opcodes:
while True: interrupted = stackless.run(100) if interrupted: print interrupted, "has been running quite a bit!" interrupted.insert() else: break # Ok, nothing runnable anymore
This code may cause a tasklet to be interrupted at an arbitrary point (actually during a tick interval, the same point that yields the GIL) and cause a switch to the main tasklet.
Of course, not all code uses this execution mode, but never the less, it has always been considered a good idea to be aware of this. For this reason, an atomic mode has been supported which would inhibit this involuntary switching in sensitive areas:
oldvalue = stackless.getcurrent().set_atomic(1) try: myglobalvariable1 += 1 myglobalvariable2 += 2 finally: stackless.getcurrent().set_atomic(oldvalue)
The above is then optionally wrapped in a context manager for readability:
@contextlib.contextmanager def atomic() oldv = stackless.getcurrent().set_atomic(1) try: yield None finally: stackless.getcurrent().set_atomic(old)
the atomic state is a property of each tasklet and so even when there is voluntary switching performed while a non-zero atomic state is in effect, it has no effect on other tasklets. Its only effect is to inhibit involuntary switching of the tasklet on which it is set.
A Concrete Example
To better illustrate its use, lets take a look at the implementation of the Semaphore from stacklesslib (stacklesslib.locks.Semaphore):
class Semaphore(LockMixin): def __init__(self, value=1): if value < 0: raise ValueError self._value = value self._chan = stackless.channel() set_channel_pref(self._chan) def acquire(self, blocking=True, timeout=None): with atomic(): # Low contention logic: There is no explicit handoff to a target, # rather, each tasklet gets its own chance at acquiring the semaphore. got_it = self._try_acquire() if got_it or not blocking: return got_it wait_until = None while True: if timeout is not None: # Adjust time. We may have multiple wakeups since we are a # low-contention lock. if wait_until is None: wait_until = elapsed_time() + timeout else: timeout = wait_until - elapsed_time() if timeout < 0: return False try: lock_channel_wait(self._chan, timeout) except: self._safe_pump() raise if self._try_acquire(): return True def _try_acquire(self): if self._value > 0: self._value -= 1 return True return False
This code illustrates how the atomic state is incremented (via a context manager) and kept non-zero while we are doing potentially sensitive things, in this case, doing logic based on self._value. Since this is code that is used for implementing a Semaphore, which itself forms the basis of other stacklesslib.locks objects such as CriticalSection and Condition objects, this is the only way we have to ensure atomicity.
Threads
It is worth noting that using the atomic property has largely been confined to such library code as the above. Most stackless programs indeed do not run the watchdog in interruptible mode, or they use the so-called soft-interrupt mode which breaks the scheduler only at the aforementioned voluntary switch points.
However, in the last two years or so, I have been increasingly using Stackless Python in conjunction with OS threads. All the stackless constructs, such as channels and tasklets work with threads, with the caveat that synchronized rendezvous isn’t possible between tasklets of different threads. A channel.send() where the recipient is a tasklet from a different thread from the sender will always cause the target to become runnable in that thread, rather than to cause immediate switching.
Using threads has many benefits. For one, it simplifies certain IO operations. Handing a job to a tasklet on a different thread won’t block the main thread. And using the usual tasklet communication channels to talk uniformly to all tasklets, whether they belong to this thread or another, makes the architecture uniform and elegant.
The locking constructs in stacklesslib also all make use of non-immediate scheduling. While we use the stackless.channel object to wait, we make no assumptions about immediate execution when a target is woken up. This makes them usable for synchronization between tasklets of different threads.
Or, this is what I thought, until I started getting strange errors and realized that tasklet.atomic wasn’t inhibiting involuntary switching between threads!
The GIL
You see, Python internally can arbitrarily stop executing a particular thread and start running another. This is called yielding the GIL and it happens at the same part in the evaluation loop as that involuntary breaking of a running tasklet would have been performed. And stackless’ atomic property din’t affect this behaviour. If the python evaluation loop detects that another thread is runnable and waiting to execute python code, it may arbitrariliy yield the GIL to that thread and wait to reacquire the GIL again.
When using the above lock to synchronize tasklets from two threads, we would suddenly have a race condition, because the atomic context manager would no longer prevent two tasklets from making simultaneous modifications to self._value, if those tasklets came belonged to different threads.
A Conundrum
So, how to fix this? An obvious first avenue to explore would be to use one of the threading locks in addition to the atomic flag. For the sake of argument, let’s illustrate with a much simplified lock:
class SimpleLock(object): def __init__(self): self._chan = stackless.channel() self._chan.preference = 0 # no preference, receiver is made runnable self._state = 0 def acquire(self): # oppertunistic lock, without explicit handoff. with atomic(): while True: if self._state == 0: self._state = 1: return self._chan.receive() def release(): with atomic(): self._state == 0 if self._chan.balance(): self._chan.send(None) # Wake up someone who is waiting
While this lock will work nicely with tasklets on the same thread. But when we try to use it for locking between two threads, the atomicity of changing self._state and examining self._chan.balance() won’t be maintained.
We can try to fix this with a proper thread lock:
class SimpleLockedLock(object): def __init__(self): self._chan = stackless.channel() self._chan.preference = 0 # no preference, receiver is made runnable self._state = 0 self._lock = threading.Lock() def acquire(self): # oppertunistic lock, without explicit handoff. with atomic(): while True: with self._lock: if self._state == 0: self._state = 1: return self._chan.receive() def release(): with atomic(): with self._lock: self._state == 0 if self._chan.balance(): self._chan.send(None) # Wake up someone who is waiting
This version is more cumbersome, of course, but the problem is, that it doesn’t really fix the issue. There is still a race condition in acquire(), between relesing self._lock and calling self._chan.receive().
Even if we were to modify self.chan.receive() to take a lock and atomically release it before blocking, and reaquire it before returning, that would be a very unsatisfying solution.
thankfully, since we needed to go and modify Stackless Python anyway, there was a much simpler solution.
Fixing Atomic
You see, Python is GIL synchronized. In the same way that only one tasklet of a particular thread is executing at the same time, then regular cPython is has the GIL property that only one of the processes thread is runinng python code at a time. So, at any one time, only one tasklet of one thread is running python code.
So, if atomic can inhibit involuntary switching between tasklets of the same threads, can’t we just extend it to inhibit involuntary switching between threads as well? Jessörry Bob, it turns out we can.
This is the fix (ceval.c:1166, python 2.7):
/* Do periodic things. Doing this every time through the loop would add too much overhead, so we do it only every Nth instruction. We also do it if ``pendingcalls_to_do'' is set, i.e. when an asynchronous event needs attention (e.g. a signal handler or async I/O handler); see Py_AddPendingCall() and Py_MakePendingCalls() above. */ #ifdef STACKLESS /* don't do periodic things when in atomic mode */ if (--_Py_Ticker < 0 && !tstate->st.current->flags.atomic) { #else if (--_Py_Ticker < 0) { #endif
That’s it! Stackless’ atomic flag has been extended to also stop the involuntary yielding of the GIL from happening. Of course voluntary yielding, such as that which is done when performing blocking system calls, is still possible, much like voluntary switching between tasklets is also possible. But when the tasklet’s atomic value is non-zero, this guarantees that no unexpected switch to another tasklet, be it on this thread or another, happens.
This fix, dear reader, was sufficient to make sure that all the locking constructs in stacklesslib worked for all tasklets.
So, what about cPython?
It is worth noting that the locks in stacklesslib.locks can be used to replace the locks in threading.locks: If your program is just a regular threaded python program, then it will run correctly with the locks from stacklesslib.locks replacing the ones in threading.locks. This includes, Semaphore, Lock, RLock, Condition, Barrier, Event and so on. and all of them are now written in Python-land using regular Python constructs and made to work by the grace of the extended tasklet.atomic property.
Which brings me to ask the question: Why doesn’t cPython have the thread.atomic property?
I have seen countless questions on the python-dev mailing lists about whether this or that operation is atomic or not. Regularly one sees implementation changes to for example list and dict operations to add a new requirement that an operation be atomic wrt. thread switches.
Wouldn’t it be nice if the programmer himself could just say: “Ah, I’d like to make sure that my updating this container here will be atomic when seen from the other threads. Let’s just use the thread.atomic flag for that.”
For cPython, this would be a perfect light-weight atomic primitive. It would be very useful to synchronize access to small blocks of code like this. For other implementations of Python, those that are truly GIL free, a thread.atomic property could be implemented with a single system global threading.RLock. Provided that we add the caveat to a thread.atomic that it should be used by all agents accessing that data, we would now have a system for mutual access that wold work very cheaply on cPython and also work (via a global lock) on other implementations.
Let’s add thread.atomic to cPython
The reasons I am enthusiastic about seeing an “atomic” flag as part of cPython are twofold:
- It would fill the role of a lightweight synchronization primitive that people are requesting where a true Lock is considered too expensive, and where it makes no sense to have a per-instance lock object.
- More importantly, it will allow Stackless functionality to be added to cPython as a pure extension module, and it will allow such inter-thread operations to be added to Greenlet-based programs in the same way as we have solved the problem for Stackless Python.
- And thirdly? Because Debbie Harry says so:
Update, 23.03.2013:
Emulating an “atomic” flag in an truly multithreaded environment with a lock is not as simple as I first though. The cool thing about “atomic” is that it still allows the thread to block, e.g. on an IO operation, without affecting other threads. For an atomic-like lock to work, such a lock would need to be automatically yielded and re-acquired when blocking, bringing us back to a condition-variable-like model. Since the whole purpose of “atomic” is to be lightweight in a GIL-like environment, forcing it to be backwards compatible with a truly multi-threaded solution is counter-productive. So, “atomic” as a GIL only feature is the only thing that makes sense, for now. Unless I manage to dream up an alternative.
Armin Rigo suggested much the same thing as a hook for C extensions to play with when he first started working on his STM experiment, so it sounds good to me. (Armin ultimately rejected his own patch, http://bugs.python.org/issue12850, as not sufficiently interesting, but I think a cheap “threading.atomic()” context manager is interesting enough to be worthwhile, particularly since it allows certain data structure manipulation code to be made safe across multiple interpreter implementations)
Yes, I should have mentioned that I had a conversation with Armin about this when he originally mentioned this on the PyPy blog. His patch was more ambitious, and involved not releasing the GIL even when explicitly doing so around blocking IO calls or other long duration calls.
The point of my proposal is to limit ourselves to merely stopping the volountary release of the GIL, but leaving explicit GIL release alone. This does not introduce any new deadlock cases.