# What, python, slow? No

## Or Numerics like itâ€™s 1959

## Or Programming High and Low

Python is famously dynamic and flexible. This flexibility comes at a cost, often on runtime performance. That cost
is often more than offset by the boost in developer productivity. But what if python developers can have their cake
and eat it too? Flexibility is not mandatory. You can *opt out* of those features that have a large performance
impact if your problem warrants it.

I am going to show how old programming techniques let you speed up numerical code without rewriting tight loops in C++ or rust thereby avoiding the two language problem. Some of these techniques form the basis of my mental model for rust (disclaimer: Iâ€™m more crustacean than rustacean, by which I mean curious bystander with some, mostly flawed, understanding of how CPUs work)

One of the best features of python is that it has no `new`

keyword like C++. Python manages your programâ€™s memory so
you donâ€™t have to. Alan Perlis captured the difference between automatic
and manual memory management in his famous â€śA LISP programmer knows the value of everything, but the cost of nothingâ€ť
epigram. Now donâ€™t get me wrong, I absolutely *love* not
having to manage memory myself. That said, surely must be an explanation why C and C++ insist on forcing it on the
unsuspecting developer. The best explanation is that inefficient memory access can annihilate your code performance.
An enormous effort has gone in the design of memory management in cpython.
Similarly, memory allocations are prominent in the
julia profiler. Quoting from that link,
â€śOne of the most common techniques to improve performance is to reduce memory allocation.â€ť

### Contrast with FORTRAN

Early versions of FORTRAN had a big drawback: they did not allow for dynamic memory allocation, forcing re-compilation
when array sizes changed. Work started on the first FORTRAN compiler in
1954. Lisp 1.5 already supported garbage collection
around that time, although the first lisp compiler did not appear until 1962. According to
Henry Spencer, â€śBackus and his team sweated quite hard on
optimization in the FORTRAN I compiler [â€¦], and so their objective was to generate code that was *better* than the
average hand-coded assembler. And astonishingly enough, in one of the first real compilers, they often succeeded.â€ť Many
agree that FORTRAN left memory management up to developers because of performance.

That made programming a rather tedious endeavor. Developers would carefully think through all the bits of memory that
their program would need. Often they would allocate a bunch of static arrays in the global scope (<gasp>). The
meaning of array elements would change in different parts of the program. This is clear when you see the number of
linear algebra subroutines which expect `WORK`

parameters (see
dgels.f for example).

### Best of both worlds?

Does it have to be an either/or situation or can python programmers have their cake and eat it too? How can pythonâ€™s
awesome versatility let us leverage these carefully coded numerical routines? The answer lies in
PEP 3118â€™s buffer protocol. Quoting from
Jake VanderPlasâ€™ excellent introduction,
the â€śPython buffer protocol [â€¦] is a framework in which Python objects can expose raw byte arrays to other Python
objects.â€ť And when those â€śother Python objectsâ€ť are thin wrappers around excellent low-level numerical libraries
like BLAS or
LAPACK, we can *embrace* the
Two-Language Problem instead of avoid it.
And all of that without writing a single line in a language other than python. To be fair, the resulting code will
**not** be particularly idiomatic. But it will still be python and offer the
readability and portability we have come to expect.

### What does it look like?

Letâ€™s explore this idea through an example. A word of warning: this example is not for the faint of heart as it
uses some rather esoteric tricks. And of course, no optimization story would be complete without the usual
reminder that â€śpremature optimization is the root of all evil.â€ť
We are going to consider the n-body benchmark. The python
benchmark links to a Wikipedia artile which frames the problem
mathematically. I highly recommend you pick up Volume 1 of the
Course of Theoretical Physics if you want to really
understand the math. And if you enjoy reading how giant of physics writing about celestial mostion, you might also
like Feynmanâ€™s Lost Lecture. The premise of that lecture is
wonderfully contrived: Isaac Newton could not use calculus to convince the Royal Society that inverse square force
leads to elliptical orbits because they would not have followed. So Newton used geometric arguments instead in his
Principia. Feynman thought Newtonâ€™s
arguments were too complicated and used his pedagogical prowess to come up with a simpler one *only* relying on
results available at the time.

The n-body benchmark is a straightforward finite-difference integration of Newtonâ€™s equations of motion:

\[\mathbf{\ddot{q}}_i = G \sum_{j \neq i} \frac{\mathbf{q}_j - \mathbf{q}_i}{\left\lVert \mathbf{q}_j - \mathbf{q}_i \right\rVert^3} m_j\]The first thing we notice is that this expression is antisymmetric. The force exerted by object \(i\) on object \(j\) is equal in magnitude but with opposite sign as the force exerted by \(j\) on \(i\). This is reminiscent of the imaginary part of a Hermitian matrix which are so common in linear algebra that library designers came up with a packed storage specifically for them. We are going to use 3 BLAS subroutines that operate on matrices in packed form:

Hermitian rank 1 operation | zhpr | \(A := \alpha x x^H + A\) |

Hermitian matrix-vector multiplication | zhpmv | \(y := \alpha A x + \beta y\) |

Symmetric rank 2 operation | dspr2 | \(A := \alpha\left(x y^T + y x^T\right)\) |

Those names are a bit cryptic because of â€śthe very tight limits of standard Fortran 77 6-character namesâ€ť. They make sense when you familiarize yourself with the reasonable naming scheme.

We need one more trick before we can leverage these routines to compute \(\mathbf{\ddot{q}}_i\). Let \(x\) and \(y\) be real numbers and \(i^2=-1\):

\[(x-i)\overline{(y-i)} = xy + 1 + i(x-y)\]This trick suggests we should subtract \(i\) from \(\mathbf{q}_i\) before invoking `zhpr`

with \(\alpha=-2\). This
operation initializes \(A_{ij}\) to \(-2\left[q_iq_j + 1 + i\left(q_i - q_j\right)\right]\). The real part of \(A\)
contains information to compute \(\left\lVert \mathbf{q}_j - \mathbf{q}_i \right\rVert^2\) when we remember that

Solving this equation for \((x-y)^2\) is precisely what `dspr2`

does when \(\alpha=-Â˝\),
\(x_i=\mathbf{q}_i\mathbf{q}^T_i\), and \(y_i=1\). Finally, `zhpmv`

performs the sum over \(j\), seen as the scaled
displacement matrix multiplying the mass vector. Pretty clever, uh. All credit goes to
Paul Panzer for this amazing
stackoverflow answer. Putting it all together, the inner loop to
calculate acceleration \(\mathbf{a}\) from positions \(q\) (with no significant memory allocations) looks like

```
a.real = x
a.imag.fill(-1.0)
ap.fill(0.0)
for i in range(3): # A += Î± z z*
zhpr(n, alpha=-2.0, x=a[i], ap=ap[i], overwrite_ap=True)
sum(ap.real, axis=0, out=xxT)
xxT += 6
take(xxT, trid, out=x2)
# A += Î± (x yáµ€ + y xáµ€)
dspr2(n, alpha=-0.5, x=x2, y=ones, ap=xxT, overwrite_ap=True)
sqrt(xxT, out=d)
multiply(xxT, d, out=xxT)
greater(xxT, 0.0, out=where)
divide(ap.imag, xxT, where=where, out=ap.imag)
for i in range(3): # y = Î± A x + Î˛ y
zhpmv(n, alpha=0.5, ap=ap[i], x=jm, beta=0.0, y=a[i], overwrite_y=True)
daxpy(a=dt, x=a.real, y=v) # v += a dt
daxpy(a=dt, x=v, y=x) # x += v dt
```

The full program can be found here.

### Postscript

I came across this excellent FORTRAN vs python anecdote a couple of weeks
after originally publishing this post. The authors make an important, albeit counterintuitive, point about the
strength of high-level languages like python compared to low-level ones like FORTRAN. â€śBy definition, a higher level
language will allow more explorations.â€ť Few programmers will completely redesign hundreds or thousands of lines of
working FORTRAN or C++. Addressing compiler errors and warnings took long enough that
theyâ€™re forced to move on to the next deadline. But a few dozen lines of python is a different story. And the
REPL even lets you explore different algorithms
*interactively*.

In conclusion, we should be more explicit when we say that a particular programming language is slow. Slow to do what? Let us not conflate iterating in tight inner loops with iterating in the problem domain.

### Do it for yourself

Click here to copy entire script to clipboard, paste it in the
dialog below then hit `Shift` + `Enter` to run!