Zombieframes. A gratuitous optimization?

Examing a recent crash case, I stumbled across this code in frameobject.c:

PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
...
if (code->co_zombieframe != NULL) {
f = code->co_zombieframe;
code->co_zombieframe = NULL;
_Py_NewReference((PyObject *)f);
assert(f->f_code == code);
}

Intrigued by the name, I examined the header where it is defined, code.h:

...
void *co_zombieframe; /* for optimization only (see frameobject.c) */
...
} PyCodeObject;

It turns out that for every PyCodeObject object that has been executed, a PyFrameObject of a suitable size is cached and kept with the code object. Now, caching is fine and good, but this cache is unbounded. Every code object has the potential to hang on to a frame, which may then never be released.
Further, there is a separate freelist cache for PyFrameObjects already, in case a frame is not found on the code object:

if (free_list == NULL) {
f = PyObject_GC_NewVar(PyFrameObject, &PyFrame_Type,
extras);
if (f == NULL) {
Py_DECREF(builtins);
return NULL;
}
}
else {
assert(numfree > 0);
--numfree;
f = free_list;
free_list = free_list->f_back;
...

Always concious about memory these days, I tried disabling this in version 3.3 and running the pybench test. I was not able to see any conclusive difference in execution speed.

Update:

Disabling the zombieframe on the PS3 shaved off some 50k on startup.  Not the jackpot, but still, small things add up.

——————————————————————————-
PYBENCH 2.1
——————————————————————————-
* using CPython 3.3.0a3+ (default, May 23 2012, 20:02:34) [MSC v.1600 64 bit (AMD64)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=2.9680909446810176e-07, implementation=QueryPerformanceCounter()

——————————————————————————-
Benchmark: nozombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:02:34 (#default)
Unicode: UCS4

——————————————————————————-
Comparing with: zombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:00:42 (#default)
Unicode: UCS4

Test minimum run-time average run-time
this other diff this other diff
——————————————————————————-
BuiltinFunctionCalls: 51ms 52ms -3.3% 52ms 53ms -2.0%
BuiltinMethodLookup: 33ms 33ms +0.0% 34ms 34ms +0.8%
CompareFloats: 50ms 50ms +0.1% 50ms 50ms +0.4%
CompareFloatsIntegers: 99ms 98ms +0.8% 99ms 99ms +0.6%
CompareIntegers: 77ms 77ms -0.5% 77ms 77ms -0.3%
CompareInternedStrings: 60ms 60ms +0.0% 61ms 61ms -0.1%
CompareLongs: 46ms 45ms +1.5% 46ms 45ms +1.2%
CompareStrings: 61ms 59ms +3.6% 61ms 59ms +3.6%
ComplexPythonFunctionCalls: 60ms 58ms +3.3% 60ms 58ms +3.2%
ConcatStrings: 48ms 47ms +2.4% 48ms 47ms +2.1%
CreateInstances: 58ms 57ms +1.3% 59ms 58ms +1.3%
CreateNewInstances: 43ms 43ms +1.1% 44ms 44ms +1.1%
CreateStringsWithConcat: 79ms 79ms -0.3% 79ms 79ms -0.1%
DictCreation: 71ms 71ms +0.4% 72ms 72ms +1.0%
DictWithFloatKeys: 72ms 70ms +2.1% 72ms 71ms +1.8%
DictWithIntegerKeys: 46ms 46ms +0.7% 46ms 46ms +0.4%
DictWithStringKeys: 41ms 41ms +0.0% 41ms 41ms -0.1%
ForLoops: 35ms 37ms -4.0% 35ms 37ms -4.0%
IfThenElse: 64ms 64ms -0.1% 64ms 64ms -0.4%
ListSlicing: 49ms 50ms -1.0% 53ms 53ms -0.8%
NestedForLoops: 54ms 51ms +6.7% 55ms 51ms +6.7%
NestedListComprehensions: 54ms 54ms -0.7% 54ms 55ms -2.2%
NormalClassAttribute: 94ms 94ms +0.1% 94ms 94ms +0.1%
NormalInstanceAttribute: 54ms 54ms +0.3% 54ms 54ms +0.2%
PythonFunctionCalls: 58ms 57ms +0.8% 58ms 58ms +0.6%
PythonMethodCalls: 65ms 61ms +6.3% 66ms 62ms +5.9%
Recursion: 84ms 85ms -1.0% 85ms 85ms -0.9%
SecondImport: 74ms 76ms -2.5% 74ms 77ms -3.5%
SecondPackageImport: 75ms 78ms -3.8% 76ms 79ms -3.9%
SecondSubmoduleImport: 163ms 169ms -3.4% 164ms 170ms -3.3%
SimpleComplexArithmetic: 43ms 43ms +1.0% 43ms 43ms +1.0%
SimpleDictManipulation: 80ms 78ms +2.2% 81ms 79ms +2.4%
SimpleFloatArithmetic: 42ms 42ms +0.1% 42ms 42ms -0.0%
SimpleIntFloatArithmetic: 52ms 53ms -1.2% 52ms 53ms -1.1%
SimpleIntegerArithmetic: 52ms 52ms -0.7% 52ms 53ms -0.8%
SimpleListComprehensions: 45ms 45ms -0.2% 45ms 45ms +0.3%
SimpleListManipulation: 44ms 46ms -4.0% 44ms 46ms -3.9%
SimpleLongArithmetic: 32ms 32ms -0.9% 32ms 32ms -0.1%
SmallLists: 58ms 57ms +1.2% 58ms 67ms -12.8%
SmallTuples: 64ms 65ms -0.5% 65ms 65ms -0.2%
SpecialClassAttribute: 148ms 149ms -0.8% 149ms 150ms -1.0%
SpecialInstanceAttribute: 54ms 54ms +0.2% 54ms 54ms +0.0%
StringMappings: 120ms 117ms +2.5% 120ms 117ms +2.5%
StringPredicates: 62ms 62ms +0.9% 62ms 62ms +1.0%
StringSlicing: 69ms 68ms +1.6% 69ms 68ms +2.1%
TryExcept: 37ms 37ms +0.0% 37ms 37ms +0.5%
TryFinally: 40ms 37ms +6.7% 40ms 37ms +6.5%
TryRaiseExcept: 19ms 20ms -1.0% 20ms 20ms -0.4%
TupleSlicing: 65ms 65ms +0.5% 66ms 65ms +1.2%
WithFinally: 57ms 56ms +1.9% 57ms 56ms +2.1%
WithRaiseExcept: 53ms 53ms +0.3% 54ms 54ms -0.8%
——————————————————————————-
Totals: 3154ms 3145ms +0.3% 3176ms 3177ms -0.0%

(this=nozombie, other=zombie)

I’m going to remove this weird, unbounded cache from the python interpreter we use on the PS3.

Advertisements

Optimizing the dict

This is another of those memory conservation stories on the PS3.

Our engineers were worried about how much memory was being spent/wasted in dictionaries. Python dicts are these sparse datastructures, optimized for performance and trading off memory usage to achieve speed.

This code shows you the memory used by dicts of various sizes:

for i in range(20): print i, sys.getsizeof(dict((j,j) for j in range(i)))
...
0 148
1 148
2 148
3 148
4 148
5 148
6 532
7 532
8 532
9 532
10 532
11 532
12 532
13 532
14 532
15 532
16 532
17 532
18 532
19 532

It’s rather striking that a 10 element dict on a 32 bit is consuming more than 1/2k of memory. I’m pretty sure BBC Basic can’t have used dicts.

Now, I was interested in tuning the dict implementation for the PS3, sacrificing performance for memory. Looking at the code led me to a file called dictnotes.txt explaining much about dicts. The section on tuning only considers performance. Too sparse a dict, you see, looses performance because of memory cache effects. Otherwise, I’m sure, we would want dicts infinitely sparse.

It turns out that only one parameter is easily tunable, PyDict_MINSIZE. Python 2.7 sets this to 8 for reasons of cache line size, although I find that an odd generalization across a huge number of platforms. in dictobject.h, I came across this comment:

* PyDict_MINSIZE is the minimum size of a dictionary. This many slots are
* allocated directly in the dict object (in the ma_smalltable member).
* It must be a power of 2, and at least 4.

It turns out this is wrong. Python will happily run with it set to 1.

As for the other tunable parameters, I ended up macrofying things that were hard-coded in various places in the code:

/* CCP change: Tunable parameters for dict growth */
#if _PyCCP_TIGHT_DICT
/* Save memory for dust */
#define _PyDICT_MAX_LOAD_NUM 4 /* grow at 80% load */
#define _PyDICT_MAX_LOAD_DENOM 5

#define _PyDICT_GROWTHRATE_NUM 3 /* scale by 1.5 */
#define _PyDICT_GROWTHRATE_DENOM 2
#else
/* max load 2/3, default python: */
#define _PyDICT_MAX_LOAD_NUM 2
#define _PyDICT_MAX_LOAD_DENOM 3

#define _PyDICT_GROWTHRATE_NUM (mp->ma_used > 50000 ? 2 : 4)
#define _PyDICT_GROWTHRATE_DENOM 1
#endif

And then later:

#if 0
    if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
        return 0;
    return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
#else
    /* CCP change, tunable growth parameters */
    if (!(mp->ma_used > n_used && mp->ma_fill*_PyDICT_MAX_LOAD_DENOM >= (mp->ma_mask+1)*_PyDICT_MAX_LOAD_NUM))
        return 0;
    return dictresize(mp, _PyDICT_GROWTHRATE_NUM * mp->ma_used / _PyDICT_GROWTHRATE_DENOM);
#endif

By using a smalltable of size 1, setting the max fill rate to 4/5 and growth rate to 1.5, we get this:
[python]
for i in range(20): print i, sys.getsizeof(dict((j,j) for j in range(i)))

0 64
1 88
2 112
3 160
4 160
5 160
6 256
7 256
8 256
9 256
10 256
11 256
12 448
13 448
14 448
15 448
16 448
17 448
18 448
19 448
[/python]

The size of the dicts is still governed by the fact that the tables must have sizes that are powers of two, so we can only delay the onset of growth. But at least we get rid of the super optimistic quadruple growth factor and the large small table.

Perhaps a different, again less optimal, version of the dict wouldn’t have that power of two requirement. There is no inherent need for that when hashing except that it makes for nice bitwise arithmetic.

Update:

The effect these changes had:

I just do a quick test. It saved 2MB in login screen. That’s awesome.

Thanks,

Kevin Zhang