Question about performance when iterating

jokasimr · March 18, 2020, 7:23pm

Hi!
Really like julia and want to learn how to debug performance better.

I’m doing a project on iterated maps, and started writing in python and then decided to try julia. But when benchmarking, the python version ran faster, but I can’t figure out why.

The question about what to vectorize, when to use dot notation etc in julia is not entirely clear to me, if you have any general advice about this I’d be happy to receive it.

This is the vectorized python code

import numpy as np

def henon(x, a=1.4, b=0.3):
    x = np.array(x)
    x, y = x[:, 0], x[:, 1]
    return np.stack([1-a*x**2+y, b*x], -1)

# in IPython REPL

# 10 000 dummy starting points
y = 0.34 * np.ones((10000, 2))                                                                   

# %%time 
for _ in range(10000): 
    y = henon(y) 
                                                                                            
#CPU times: user 731 ms, sys: 0 ns, total: 731 ms
#Wall time: 731 ms

And here is the Julia version, one loop, and one vectorized

# in the REPL
henon(x;a=1.4,b=0.3) = [1 - a*x[1]^2 - x[2], b*x[1]]
vectorized_henon(x;a=1.4,b=0.3) = cat(1 .- a.*x[:,1].^2 .- x[:,2], b.*x[:,1], dims=2)

iter(f, x, k) = (for _=1:k x=f(x) end; x)

# 10 000 dummy points, an array of arrays, and as a 2d array
y_vec = [ones(2)*0.34 for _=1:10000]
y_mat = ones(10000,2)*0.34

# for compilation
iter.(henon, y_vec, 10000)
iter(vectorized_henon, y_mat, 10000)

@time iter.(henon, y_vec, 10000)
# 5.148718 seconds (100.00 M allocations: 8.941 GiB, 30.70% gc time)
@time iter(vectorized_henon, y_mat, 10000)
# 1.236298 seconds (410.00 k allocations: 5.231 GiB, 12.20% gc time)

I’m sure I’m missing something basic here, but please tell me how to make this code more efficient.

Best wishes
Johannes

Mason · March 18, 2020, 7:53pm

The biggest inefficiency that I see here right off the bat is that you’re creating a bunch of Vectors with only two arguments which wastes a lot of time allocating memory. If instead, you use Tuples, then henon will outperform vectorized_henon:

henon(x;a=1.4,b=0.3) = [1 - a*x[1]^2 - x[2], b*x[1]]
henon2(x;a=1.4,b=0.3) = (1 - a*x[1]^2 - x[2], b*x[1])
vectorized_henon(x;a=1.4,b=0.3) = cat(1 .- a.*x[:,1].^2 .- x[:,2], b.*x[:,1], dims=2)


iter(f, x, k) = (for _=1:k x=f(x) end; x)

y_vec = [ones(2)*0.34 for _=1:10000]
y_mat = ones(10000,2)*0.34

#compilation
iter.(henon, y_vec, 10000)
iter.(henon2, y_vec, 10000)
iter(vectorized_henon, y_mat, 10000);

@time iter.(henon, y_vec, 10000)
#   4.501728 seconds (100.00 M allocations: 8.941 GiB, 22.01% gc time)
@time iter.(henon2, y_vec, 10000)
#   0.446856 seconds (11 allocations: 156.539 KiB)
@time iter(vectorized_henon, y_mat, 10000);
#   3.293624 seconds (410.00 k allocations: 5.231 GiB, 18.90% gc time)

You can get fancier with this using something like HybridArrays.jl, but this is a good start.

rdeits · March 18, 2020, 7:59pm

The first thing I would recommend is to simplify your benchmarking example. There’s no need to iterate manually over a function 10000 times, because we have GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language to do that for us. And I’ll assert that there’s rarely any need to manually vectorize code like vectorized_henon in Julia, so I’ll leave that aside for now. Let’s time your fundamental operation:

julia> henon(x;a=1.4,b=0.3) = [1 - a*x[1]^2 - x[2], b*x[1]]
henon (generic function with 1 method)

julia> using BenchmarkTools

julia> @btime henon(x) setup=(x = ones(2) * 0.34)
  22.259 ns (1 allocation: 96 bytes)
2-element Array{Float64,1}:
 0.49816            
 0.10200000000000001

Your iter loop just consists of doing that operation 10000 * 10000 times, but we’ve already gotten a pretty good performance estimate of 22ns per iteration without doing any of that:

julia> 22e-9 * 10000 * 10000
2.1999999999999997

which is close enough to the 1.23s you saw (and might be different because I’m on a different machine).

Anyway, let’s speed this up. As is so often the case, you’re repeatedly constructing small, fixed-size arrays. This is a perfect use case for https://github.com/JuliaArrays/StaticArrays.jl

julia> using StaticArrays

julia> henon_static(x;a=1.4,b=0.3) = SVector(1 - a*x[1]^2 - x[2], b*x[1])
henon_static (generic function with 1 method)

julia> @btime henon_static(x) setup=(x = SVector(0.34, 0.34))
  0.017 ns (0 allocations: 0 bytes)
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.49816            
 0.10200000000000001

Whoops, that’s too fast to be believable. Looks like the compiler has been able to optimize this function down to a constant answer. We can add a Ref to force the compiler to actually do the computation:

julia> @btime henon_static(x[]) setup=(x = Ref(SVector(0.34, 0.34)))
  1.733 ns (0 allocations: 0 bytes)
2-element SArray{Tuple{2},Float64,1,2} with indices SOneTo(2):
 0.49816            
 0.10200000000000001

That’s more like it. 1-2ns is about what I would expect for such a simple operation. And we’re now more than 10x faster than your original code, and we’re allocating zero memory. Hooray for StaticArrays!

jokasimr · March 18, 2020, 9:53pm

Thank you both! This was very helpful. You both provided solutions to my problem.

Topic		Replies	Views
Performance difference between two code New to Julia	3	778	January 18, 2017
General questions from Python user Performance	59	4238	March 8, 2021
Julia Performance - Help Needed Performance question , python	40	2853	September 17, 2021
Question on simple performance comparison between Python and Julia General Usage question	23	977	June 13, 2023
Loop vs vectorization Performance	4	1239	October 10, 2023

Question about performance when iterating

Related topics