What, python, slow? No

Or Numerics like it’s 1959

Or Programming High and Low

Python is famously dynamic and flexible. This flexibility comes at a cost, often on runtime performance. That cost is often more than offset by the boost in developer productivity. But what if python developers can have their cake and eat it too? Flexibility is not mandatory. You can opt out of those features that have a large performance impact if your problem warrants it.

I am going to show how old programming techniques let you speed up numerical code without rewriting tight loops in C++ or rust thereby avoiding the two language problem. Some of these techniques form the basis of my mental model for rust (disclaimer: I’m more crustacean than rustacean, by which I mean curious bystander with some, mostly flawed, understanding of how CPUs work)

One of the best features of python is that it has no new keyword like C++. Python manages your program’s memory so you don’t have to. Alan Perlis captured the difference between automatic and manual memory management in his famous “A LISP programmer knows the value of everything, but the cost of nothing” epigram. Now don’t get me wrong, I absolutely love not having to manage memory myself. That said, surely must be an explanation why C and C++ insist on forcing it on the unsuspecting developer. The best explanation is that inefficient memory access can annihilate your code performance. An enormous effort has gone in the design of memory management in cpython. Similarly, memory allocations are prominent in the julia profiler. Quoting from that link, “One of the most common techniques to improve performance is to reduce memory allocation.”

Contrast with FORTRAN

Early versions of FORTRAN had a big drawback: they did not allow for dynamic memory allocation, forcing re-compilation when array sizes changed. Work started on the first FORTRAN compiler in 1954. Lisp 1.5 already supported garbage collection around that time, although the first lisp compiler did not appear until 1962. According to Henry Spencer, “Backus and his team sweated quite hard on optimization in the FORTRAN I compiler […], and so their objective was to generate code that was better than the average hand-coded assembler. And astonishingly enough, in one of the first real compilers, they often succeeded.” Many agree that FORTRAN left memory management up to developers because of performance.

That made programming a rather tedious endeavor. Developers would carefully think through all the bits of memory that their program would need. Often they would allocate a bunch of static arrays in the global scope (<gasp>). The meaning of array elements would change in different parts of the program. This is clear when you see the number of linear algebra subroutines which expect WORK parameters (see dgels.f for example).

Best of both worlds?

Does it have to be an either/or situation or can python programmers have their cake and eat it too? How can python’s awesome versatility let us leverage these carefully coded numerical routines? The answer lies in PEP 3118’s buffer protocol. Quoting from Jake VanderPlas’ excellent introduction, the “Python buffer protocol […] is a framework in which Python objects can expose raw byte arrays to other Python objects.” And when those “other Python objects” are thin wrappers around excellent low-level numerical libraries like BLAS or LAPACK, we can embrace the Two-Language Problem instead of avoid it. And all of that without writing a single line in a language other than python. To be fair, the resulting code will not be particularly idiomatic. But it will still be python and offer the readability and portability we have come to expect.

What does it look like?

Let’s explore this idea through an example. A word of warning: this example is not for the faint of heart as it uses some rather esoteric tricks. And of course, no optimization story would be complete without the usual reminder that “premature optimization is the root of all evil.” We are going to consider the n-body benchmark. The python benchmark links to a Wikipedia artile which frames the problem mathematically. I highly recommend you pick up Volume 1 of the Course of Theoretical Physics if you want to really understand the math. And if you enjoy reading how giant of physics writing about celestial mostion, you might also like Feynman’s Lost Lecture. The premise of that lecture is wonderfully contrived: Isaac Newton could not use calculus to convince the Royal Society that inverse square force leads to elliptical orbits because they would not have followed. So Newton used geometric arguments instead in his Principia. Feynman thought Newton’s arguments were too complicated and used his pedagogical prowess to come up with a simpler one only relying on results available at the time.

The n-body benchmark is a straightforward finite-difference integration of Newton’s equations of motion:

\[\mathbf{\ddot{q}}_i = G \sum_{j \neq i} \frac{\mathbf{q}_j - \mathbf{q}_i}{\left\lVert \mathbf{q}_j - \mathbf{q}_i \right\rVert^3} m_j\]

The first thing we notice is that this expression is antisymmetric. The force exerted by object \(i\) on object \(j\) is equal in magnitude but with opposite sign as the force exerted by \(j\) on \(i\). This is reminiscent of the imaginary part of a Hermitian matrix which are so common in linear algebra that library designers came up with a packed storage specifically for them. We are going to use 3 BLAS subroutines that operate on matrices in packed form:

Hermitian rank 1 operation	zhpr	\(A := \alpha x x^H + A\)
Hermitian matrix-vector multiplication	zhpmv	\(y := \alpha A x + \beta y\)
Symmetric rank 2 operation	dspr2	\(A := \alpha\left(x y^T + y x^T\right)\)

Those names are a bit cryptic because of “the very tight limits of standard Fortran 77 6-character names”. They make sense when you familiarize yourself with the reasonable naming scheme.

We need one more trick before we can leverage these routines to compute \(\mathbf{\ddot{q}}_i\). Let \(x\) and \(y\) be real numbers and \(i^2=-1\):

\[(x-i)\overline{(y-i)} = xy + 1 + i(x-y)\]

This trick suggests we should subtract \(i\) from \(\mathbf{q}_i\) before invoking zhpr with \(\alpha=-2\). This operation initializes \(A_{ij}\) to \(-2\left[q_iq_j + 1 + i\left(q_i - q_j\right)\right]\). The real part of \(A\) contains information to compute \(\left\lVert \mathbf{q}_j - \mathbf{q}_i \right\rVert^2\) when we remember that

\[-2 xy = (x - y)^2 - x^2 - y^2\]

Solving this equation for \((x-y)^2\) is precisely what dspr2 does when \(\alpha=-½\), \(x_i=\mathbf{q}_i\mathbf{q}^T_i\), and \(y_i=1\). Finally, zhpmv performs the sum over \(j\), seen as the scaled displacement matrix multiplying the mass vector. Pretty clever, uh. All credit goes to Paul Panzer for this amazing stackoverflow answer. Putting it all together, the inner loop to calculate acceleration \(\mathbf{a}\) from positions \(q\) (with no significant memory allocations) looks like

a.real = x
a.imag.fill(-1.0)
ap.fill(0.0)
for i in range(3):  # A += α z z*
    zhpr(n, alpha=-2.0, x=a[i], ap=ap[i], overwrite_ap=True)

sum(ap.real, axis=0, out=xxT)
xxT += 6
take(xxT, trid, out=x2)
# A += α (x yᵀ + y xᵀ)
dspr2(n, alpha=-0.5, x=x2, y=ones, ap=xxT, overwrite_ap=True)

sqrt(xxT, out=d)
multiply(xxT, d, out=xxT)
greater(xxT, 0.0, out=where)
divide(ap.imag, xxT, where=where, out=ap.imag)

for i in range(3):  # y = α A x + β y
    zhpmv(n, alpha=0.5, ap=ap[i], x=jm, beta=0.0, y=a[i], overwrite_y=True)

daxpy(a=dt, x=a.real, y=v)  # v += a dt
daxpy(a=dt, x=v, y=x)  # x += v dt

The full program can be found here.

Postscript

I came across this excellent FORTRAN vs python anecdote a couple of weeks after originally publishing this post. The authors make an important, albeit counterintuitive, point about the strength of high-level languages like python compared to low-level ones like FORTRAN. “By definition, a higher level language will allow more explorations.” Few programmers will completely redesign hundreds or thousands of lines of working FORTRAN or C++. Addressing compiler errors and warnings took long enough that they’re forced to move on to the next deadline. But a few dozen lines of python is a different story. And the REPL even lets you explore different algorithms interactively.

In conclusion, we should be more explicit when we say that a particular programming language is slow. Slow to do what? Let us not conflate iterating in tight inner loops with iterating in the problem domain.

See it for yourself

I feel pretty strongly that pyodide fits Arthur C. Clarke’s third law: it lets you run python in the browser via the techno-magic of WebAssembly. So while you were reading this post, your browser downloaded a Wasm port of cpython, pip installed matplotlib, scipy (as well as the transitive closure of their dependencies), generated the trajectories of Jovian planets, and rendered them to the HTML animation you can see below. All of this inside your browser!