Why is Julia's performance more sensitive to memory allocations than Numpy's?

This is a general, “deepening my understanding”-type question and it isn’t about a particular piece of code.

If you read the performance section of this forum, a common pattern
is for people to write code that does something like xs[:] = ys + zs inside of
a loop. Replacing this with an elementwise xs[i] = ys[i] + zs[i] or similar
often yields a dramatic improvement in speed because it obviates the intermediate
allocation ys + zs.

In Numpy, xs[:] = ys + zs also allocates an intermediate array (as I
understand it), but rewriting this with a for loop is even worse for performance
because the loop runs in native Python instead of optimized Numpy code. To avoid
the intermediate allocation without writing a for loop, you can use ufuncs
with the out= keyword instead, like np.add(ys, zs, out=xs).

The ufunc approach isn’t quite as flexible as Julia’s for loops: For complicated
calculations like xs[:] = xs**2 / (1 + np.exp(xs)), if you aren’t allowed to
do a = xs[i]**2; b = np.exp(xs[i]); xs[i] = a / (1 + b) (because you aren’t
allowed to use a for loop), then you really need to use intermediate arrays to
hold xs**2 and np.exp(xs). Thus, an optimized Numpy implementation of the
full function might look like:

def foo(xs):
    """Set xs to xs**2 / (1 + np.exp(xs)), in place.
    """
    np.exp(xs, out=ys)
    np.add(1, ys, out=ys)         # now ys holds the denominator
    np.multiply(xs, xs, out=xs)   # now xs holds the numerator
    np.divide(xs, ys, out=xs)

This code lets you get away with just one allocation (ys) but is pretty
illegible (IMO). You can push the allocation savings further by accepting ys
as an argument to foo() so that you can pre-allocate the memory when calling
foo() repeatedly from a loop. And that’s exactly what I do in the large Numpy
library I maintain at work. I have a bunch of function signatures like
foo(xs, buffer) that modify xs in-place, and use buffer to store
intermediate calculations. If you have Julia’s performance model in your head,
but then also know that you aren’t supposed to use Python for loops in Numpy,
then this design makes sense, right?

Anyway, this gets me to my question: On a whim, I decided to try removing all
the “buffer” arguments from my code and just let Numpy allocate new arrays.
After running a profile, I found no measurable performance impact. All of my
manual memory management was in vain. What gives?

Indeed, if you search around for Numpy optimization tips, you’ll see a lot about
avoiding native Python code in favor of ufuncs, and the occasional
recommendation to write xs[:] = ... instead of xs = ..., but never the sort
of fine-toothed “avoid every possible allocation” advice that is so often
repeated on the Julia forum. It seems to me that Numpy just has a different
performance model than Julia in which the stray allocation here and there simply
isn’t that big of a deal. But why is this the case? Is it that the Python glue
code already slows things down so much that the allocation benefits don’t
matter? Or what?

4 Likes

It’s a rule of thumb, is true, while allocations in loops can be fast as a workaround with:

Besides Libc.malloc and free, in Julia’s stdlib, there’s also:

1 Like

As I understand it this is largely a combination of several factors related to the differences in Python and Julia’s memory management:

  1. python uses Reference counting so it can tell exactly when an object is dead (at the cost of overhead)
  2. Numpy (IIRC) stores the most recently deallocated array so it can give it right back to you if your next operation needs an array of exactly the same size
  3. Python isn’t multi-threaded which makes some of this easier
  4. Numpy has a higher per-call overhead which makes the difference between saving the allocation less of a big difference.
16 Likes

But is the premise true?

What we have to avoid in Julia is the repeated allocation inside critical loops.

A single vectorized calculation with large arrays, with intermediates, i. e.

y = a * x + b

is slower in Julia than with numpy?

If the operations are much more expensive than the allocations, there is no reason for either to be faster.

Now if the arrays are small and that is being called thousands of times inside a loop, the performance is probably pretty bad in both cases, without avoiding the allocations.

Edit: Could resist testing, although the numpy benchmarks might be wrong:

Few allocations in large computation:

julia>  a=rand(10_000,10_000); x=rand(10_000); b=rand(10_000);

julia> import LinearAlgebra; LinearAlgebra.BLAS.set_num_threads(1)

julia> using Chairmarks

julia> @b $a * $x + $b
42.324 ms (6 allocs: 156.375 KiB)
In [38]: x = np.random.random(10_000)

In [39]: a = np.random.random((10_000,10_000))

In [40]: b = np.random.random(10_000)

In [41]: %timeit a @ x + b
52 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Repeated allocations in loop:

julia> a=rand(10,10); x=rand(10); b=rand(10);

julia> using LinearAlgebra: norm

julia> function test(a,x,b)
           s = 0.0
           for i in 1:10_000
               y = a * x + b
               s += norm(y)
           end
           return s
       end
test (generic function with 1 method)

julia> @b test($a, $x, $b)
1.501 ms (40000 allocs: 2.747 MiB)
In [52]: x = np.random.random(10)

In [53]: a = np.random.random((10,10))

In [54]: b = np.random.random(10)

In [55]: def test(a,x,b) :
    ...:     s = np.float64(0)
    ...:     for i in range(0,10_000) :
    ...:         y = a @ x + b
    ...:         s += np.linalg.norm(y)
    ...:     return s
    ...: 

In [56]: %timeit test(a,x,b)
39.7 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As I mentioned, the python benchmark might be wrong, but I don’t see the premise being true. It is just that, avoiding allocations, one can get much faster, and for loops being fast can make that easy:

julia> a=rand(SMatrix{10,10,Float64}); x=rand(SVector{10,Float64}); b=rand(SVector{10,Float64});

julia> @b test($a, $x, $b)
10.047 μs

7 Likes

For the example calculation in the OP, the allocation overhead with Julia on my machine is 23 % or so for moderately sized arrays and up, and larger for small array sizes where the cost of each allocation hits a size-independent lower bound. That’s definitely measurable, but not overwhelming. If the difference in Python is truly not measurable I bet the main reason is some numpy reuse magic as mentioned by @Oscar_Smith.

Either way, an important takeaway is that the relative cost of allocations depends on the details of the algorithm and how the code is structured. It’s not always catastrophic, and can even vanish with growing problem size if the computational cost scales faster than linearly with the array size. Matrix-matrix [matrix-vector] multiplication is a typical example, where allocating the output is O(N^2) [O(N)] and the computation itself is O(N^3) [O(N^2)].

The most important heuristic is that if you allocate in each iteration of your innermost loop, that’s often an easy and worthwhile target for improving performance. That’s almost never the case when using numpy.

Benchmarks

Side note: look at the implementation of foo_allocating—if your calculation is naturally written as a fused broadcast/plain loop, it’s actually hard work to get numpy-style allocation behavior in Julia. The language pushes you towards a less allocating, more efficient implementation.

julia> function foo_allocating!(xs)
           # Need to put intermediate computations on separate lines
           # to avoid loop fusing
           xssq = xs.^2
           expxs = exp.(xs)
           expxsp1 = 1 .+ expxs
           newxs = xssq ./ expxsp1
           xs .= newxs
           return nothing
       end;

julia> function foo_preallocated!(xs, buffer)
           # Like the OP's numpy example
           buffer .= exp.(xs)
           buffer .+= 1
           xs .= xs.^2
           xs ./= buffer
           return nothing
       end;

julia> function foo_fusedloop!(xs)
           xs .= xs.^2 ./ (1 .+ exp.(xs))
           return nothing
       end;

julia> using BenchmarkTools

julia> N = 10; buffer = zeros(N);

julia> @btime foo_allocating!(xs) setup=(xs = rand(N));
  396.430 ns (4 allocations: 576 bytes)

julia> @btime foo_preallocated!(xs, $buffer) setup=(xs = rand(N));
  146.895 ns (0 allocations: 0 bytes)

julia> @btime foo_fusedloop!(xs) setup=(xs = rand(N));
  128.114 ns (0 allocations: 0 bytes)

julia> N = 1000; buffer = zeros(N);

julia> @btime foo_allocating!(xs) setup=(xs = rand(N));
  12.276 μs (4 allocations: 31.75 KiB)

julia> @btime foo_preallocated!(xs, $buffer) setup=(xs = rand(N));
  9.968 μs (0 allocations: 0 bytes)

julia> @btime foo_fusedloop!(xs) setup=(xs = rand(N));
  10.269 μs (0 allocations: 0 bytes)

julia> N = 100000; buffer = zeros(N);

julia> @btime foo_allocating!(xs) setup=(xs = rand(N));
  1.233 ms (8 allocations: 3.05 MiB)

julia> @btime foo_preallocated!(xs, $buffer) setup=(xs = rand(N));
  1.052 ms (0 allocations: 0 bytes)

julia> @btime foo_fusedloop!(xs) setup=(xs = rand(N));
  1.032 ms (0 allocations: 0 bytes)
3 Likes

As an aside for the NumPy optimization part, you can do fast loops over supported NumPy operations on scalars in Numba. I’m pretty sure Numba makes NumPy code on arrays just behave normally instead of doing any optimizations like loop fusion, but it’s been a few versions since I used it and measuring allocation overhead was never easy (not that it’s easy anywhere). EDIT: nvm there is a loop fusion optimization, it’s just triggered by variables exclusive to an operations pipeline rather than expression-wise dot syntax.

One difference I see in the benchmarks here is that python reports the mean time, whereas Julia reports probably the minimum? At least that’s what BenchmarkTools does. The mean time might be impacted by outliers.

3 Likes

Thats for numba not numpy

Numba is designed to optimize NumPy, which is part of the topic.

That was my first thought too.

Implicitly, the performance model of Julia assumes that you don’t allocate in tight loops, but allocations outside those are pretty innocuous. And the recent threaded GC improvements are fantastic.

3 Likes

Does this mean better GC performance in multi-threaded code, or general performance improvements by parallelizing the GC?

2 Likes

Yes, true, but that doesn’t change the overall profile. I was about to replace that in the post, but in the machine I’m now Julia is 2x faster (mean considered) than numpy in the first case and I have no idea why, so I prefer not to enter into that.

Thanks, all! Yes, if it wasn’t clear from the OP, I am interested in loop operations, not one-time allocations.

Couple of follow up Qs:

  1. As @danielwe’s benchmark shows, sometimes it is necessary in Julia as well to introduce buffer arguments to functions to maximize memory reuse (and minimize allocations). These can impede legibility and are prone to creating bugs if e.g. a buffer argument is used as a function’s return value (as sometimes is needed to save memory). How do good Julia developers handle this emergent “manual memory management”? Are there any conventions for naming buffer arguments/positioning them in a function signature? Would you still put an exclamation mark in foo!(x, buffer) if it only mutates the buffer and not x? I appreciate @Palli’s recommendations but I actually want to go the other way—to make my code as readable as possible while still providing the compiler/interpreter whatever affordances it needs to perform well.
  2. Has there been any discussion of doing some of the Numpy-style magic vector reuse stuff in Julia? If I’m understanding correctly, that Numpy sugar is doing a lot of helpful work in typical code monkey notebooks, e.g. repeated, non-inplace Pandas operations for trivial unit conversions etc. which would other be allocating MB per call. I understand that multi-threading is a challenge, although the Julia docs warn that thread safety is basically the user’s responsibility (Multi-Threading · The Julia Language).

Sometimes. In this example, all of the methods are undefined for vectors, so as soon as you add dots to fix the method errors in xs ^ 2 / (1 + exp(xs)), all you have to remember to do is use .= instead of = to prevent allocations. But if the expression were xs = A * ys + zs instead, it’s hard to argue that Julia’s design “pushes you” to, uh, muladd!()? (See https://discourse.julialang.org/t/muladd-useful/89517/.) You just have to know (from profiling, experience, or Discourse) that this introduces unnecessary allocations. And that’s fine.

But it is kinda nice that in Numpy, both xs = xs ** 2 / (1 + np.exp(xs)) and xs = A @ ys + zs are equally good or bad: Both yield correct math, both allocate, and both allow Numpy to optimize away some of the allocation cost with the memory reuse magic. I’m not trying to complain, but I am interested in the very Julian question of whether a language can be both highly legible (“reads like the math”) and free of performance gotchas.

1 Like

But is that worse in Julia than with numpy? I’m missing the benchmarks showing the situations in which that’s the case.

This isn’t a question about whether Numpy vs. Julia is faster, it’s about how they differ in terms of the coding style that yields the best performance in that language.

Array languages like APL use nonmutating operations so you don’t have to pass buffers around. They can be very fast. How does BQN perform?

Well, I think the question makes sense when there’s an example where it is clear that one allocation model is different than the other. So, for example, in this example where allocations play an important role, but are not determining for the runtime:

julia> using LinearAlgebra: norm

julia> using Chairmarks

julia> function test(a,x,b)
           s = 0.0
           for i in 1:10_000
               y = a * x + b
               s += norm(y)
           end
           return s
       end
test (generic function with 1 method)

julia> a=rand(1000,1000); x=rand(1000); b=rand(1000);

julia> @b test($a, $x, $b)
1.296 s (60000 allocs: 153.809 MiB, 1.01% gc time, without a warmup)

vs.

In [64]: import numpy as np

In [65]: def test(a,x,b) :
    ...:     s = np.float64(0)
    ...:     for i in range(0,10_000) :
    ...:         y = a @ x + b
    ...:         s += np.linalg.norm(y)
    ...:     return s
    ...: 

In [66]: a = np.random.random((1000,1000))

In [67]: x = np.random.random(1000)

In [68]: b = np.random.random(1000)

In [69]: %timeit test(a,x,b)
1.3 s ± 275 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Both perform equally.

The non-allocating Julia version is:

julia> function test(a,x,b)
           s = 0.0
           y = similar(x)
           for i in 1:10_000
               y .= mul!(y,a,x) .+ b
               s += norm(y)
           end
           return s
       end
test (generic function with 1 method)

julia> @b test($a, $x, $b)
1.207 s (3 allocs: 7.875 KiB, without a warmup)

So, here, where the computation time is much larger than allocation time, there is no such great sensitivity to memory allocation. In python that will probably be the same (the effort to write non-allocating code will not payoff).

I don´t see an example where the tradeoffs of sensitivities are different in both languages. What happens, I think, is that when using numpy one already writes the code in vectorized form, thus fusing all the “small” allocations into chunks of big allocations, which then will play a smaller role in the total computing time. In python + numpy one is forced to write code like that, and the effort to write that code in a non-allocating way is less rewarded, because part of the work of fusing allocations is imposed by the coding style.

8 Likes

The example of idiomatic Julia code in my benchmarks was the third version foo_fusedloop!, which shows that a preallocated buffer is not necessary to implement this particular computation without allocations. The point of the somewhat contorted foo_preallocated! was to obtain the same behavior as numpy with in-place operations so that we could see how the performance of the two numpy-equivalent implementations compare in Julia, while also comparing both to an idiomatic Julia version.

Agreed. Note the clause preceding the statement you quoted: “if your calculation is naturally written as a fused broadcast/plain loop, …”.

1 Like

Overall both, but specific PRs have different effects on these. You can look at benchmarks that include --gcthreads to see the improvements in the latter.

EDIT: These improvements are indeed so amazing that a Julia blog post (on the language page) summarizing them would make sense. I would ping Github user @d-netto, who made a lot of key contributions, here but could not find the corresponding Discourse account.

5 Likes