This is a general, “deepening my understanding”-type question and it isn’t about a particular piece of code.
If you read the performance section of this forum, a common pattern
is for people to write code that does something like xs[:] = ys + zs
inside of
a loop. Replacing this with an elementwise xs[i] = ys[i] + zs[i]
or similar
often yields a dramatic improvement in speed because it obviates the intermediate
allocation ys + zs
.
In Numpy, xs[:] = ys + zs
also allocates an intermediate array (as I
understand it), but rewriting this with a for loop is even worse for performance
because the loop runs in native Python instead of optimized Numpy code. To avoid
the intermediate allocation without writing a for loop, you can use ufuncs
with the out=
keyword instead, like np.add(ys, zs, out=xs)
.
The ufunc approach isn’t quite as flexible as Julia’s for loops: For complicated
calculations like xs[:] = xs**2 / (1 + np.exp(xs))
, if you aren’t allowed to
do a = xs[i]**2; b = np.exp(xs[i]); xs[i] = a / (1 + b)
(because you aren’t
allowed to use a for loop), then you really need to use intermediate arrays to
hold xs**2
and np.exp(xs)
. Thus, an optimized Numpy implementation of the
full function might look like:
def foo(xs):
"""Set xs to xs**2 / (1 + np.exp(xs)), in place.
"""
np.exp(xs, out=ys)
np.add(1, ys, out=ys) # now ys holds the denominator
np.multiply(xs, xs, out=xs) # now xs holds the numerator
np.divide(xs, ys, out=xs)
This code lets you get away with just one allocation (ys
) but is pretty
illegible (IMO). You can push the allocation savings further by accepting ys
as an argument to foo()
so that you can pre-allocate the memory when calling
foo()
repeatedly from a loop. And that’s exactly what I do in the large Numpy
library I maintain at work. I have a bunch of function signatures like
foo(xs, buffer)
that modify xs
in-place, and use buffer
to store
intermediate calculations. If you have Julia’s performance model in your head,
but then also know that you aren’t supposed to use Python for loops in Numpy,
then this design makes sense, right?
Anyway, this gets me to my question: On a whim, I decided to try removing all
the “buffer” arguments from my code and just let Numpy allocate new arrays.
After running a profile, I found no measurable performance impact. All of my
manual memory management was in vain. What gives?
Indeed, if you search around for Numpy optimization tips, you’ll see a lot about
avoiding native Python code in favor of ufuncs, and the occasional
recommendation to write xs[:] = ...
instead of xs = ...
, but never the sort
of fine-toothed “avoid every possible allocation” advice that is so often
repeated on the Julia forum. It seems to me that Numpy just has a different
performance model than Julia in which the stray allocation here and there simply
isn’t that big of a deal. But why is this the case? Is it that the Python glue
code already slows things down so much that the allocation benefits don’t
matter? Or what?