Benchmarking Sphinx
PyPy has an extensive benchmark suite. On most of the microbenchmarks we do quite well. One of the more complete benchmarks is one that builds a sphinx document website. The benchmark takes ~175 seconds on PyPy3 without the JIT and ~80 seconds with the JIT on the benchmarking machine, CPython finishes the benchmark in about 45 seconds. This is a good opportunity to explore how to analyze PyPy’ performance and suggest paths forward to speed it up.
The problem is to find out what exactly is going on, where the slowdowns are. Ususally, you need to combine cprofile and JIT profiling. A typical workflow is to use cprofile with CPython and PyPy to get a deterministic call tree. Then use vmprof for a sampling profiler, and use the JIT log for the actual Jitted code.
Running the benchmark
Here is how to run the benchmark in a posix shell:
hg clone https://foss.heptapod.net/pypy/benchmarks cd benchmarks PYTHONPATH=lib:lib/jinja2/ <python> \ benchmarks/lib/cpython-doc/tools/sphinx-build.py -b html
Since we want the benchmark to run consistently over many versions of
python, all the third-party packages are part of the benchmark repo:
docutils
, sphinx
, and of course the source documentation itself.
So we use PYTHONPATH
to run the benchmark. Replace <python>
with
the version of python you wish to benchmark. To run PyPy without the
JIT, use <pypy> --jit off
Getting started: cprofile
OK, now we know how to run the benchmark, but we aer looking for places
PyPy could do better. One way to profile is to use the built-in
cprofile
module. We can write the output to a file. However, any
deterministic profiler will slow down the program being profiled
considerably, so this is best used for a rought measure of performance.
We can run the command above, but instead of <python>
, use
<python> -m cprofile
.
Pickle
The output from a CPython run and a PyPy run can be diffed. That quickly
pointed to write_tree
as a place where something is off: CPython
runs it in ~1 sec and PyPy takes ~44 secs. Sphinx uses pickle underneath
to save intermediary results. CPython has the _pickle
c-extension
module built in, PyPy uses a pure-python implementation. So some of this
slowdown may be due to cprofile overhead on the python code. We know
pickling dictionaries is 8x slower on PyPy, but 44x is a bit extreme. So
if PyPy had a faster pickle implementation that would make many things
faster. A nice task for someone who wants to learn about RPython might
be to transpile the _pickle.c
code (or the pure-python parts of
pickle.py
that _pickle.c
reimplements), add tests and make it
fast.
docutils
Another easy comparison is in the docutils
(a pure-python library)
nodes._fast_traverse
functions. PyPy takes 70s on where CPython
needs ~18. _fast_traverse
is a six-line function:
def _fast_traverse(self, cls): """Specialized traverse() that only supports instance checks.""" result = [] if isinstance(self, cls): result.append(self) for child in self.children: result.extend(child._fast_traverse(cls)) return result
It does a check, appends to a list, then recurses, and returns the list. Recursion is slow on both CPython and PyPy: rewriting this as a non-recursive function would be a clear win. But this benchmark is now fixed in time. We could add a new benchmark with the changed code, that would not help with this one.
Next step: the JIT log
Now that we know a bit more about the performance bottlenecks, let's dive into what is going on inside the JIT. We can run the command above a little differently in order to create a log of the JIT activity:
PYPYLOG=jit-log-opt,jit-summary,jit-backend-counts:out-sphinx PYTHONPATH=lib:lib/jinja2/ ~/bin/pypy lib/cpython-doc/tools/sphinx-build.py -b html lib/cpython-doc/ lib/cpython-doc/build/ -E
The PYPYLOG
value saves only some of the JIT information. Like
cprofile
, writing this information out slows down the JIT.
The last section of the log (backend-counts) summarizes counts of the bridges and loops, so you can see which are the "busiest" loops. You search for the target token (the address of the loop)
Just before that is the jit-summary. It can tell you the time spent tracing and in the backend. That is about 15% in our case.
One other hack is to sort the debug_merge_points
by function name:
grep debug_merge_point out-sphinx | cut -d ' ' -f 5,6,7,9 | sort | uniq -c | sort --numeric-sort
We made way too many versions of _fast_traverse
, and it is
recursive. This is because it makes bridges (failed guard calling out to
other code). So in the log there is a # bridge ...
comment, it has
an address. We found one that points to the guard on the loop call,
generating a new version for each combination of type(self)
and
type(cls)
. This is silly, most of the calls to _fast_travserse
will create such a bridge.
Aside: why is there a jit code ll_listcontains...
? This always
happens when looking at a trace: you find things that drag you down a
rabbit hole.
One approach to solving this would be "give up": do not create another guard/bridge for each specialization. Another solution: check the depth of the chain of bridges and do something if it is too deep. We need to be careful not to, for instance, make translation slower.
In order to create a test for this in the JIT, we would need to recreate
it in a test. We would need a simple class implementation with a
function that looks like _fast_traverse
, and a class generator
factory that would create the classes in a way that approximately
recreates the sphinx situation.
Comments