How to improve performance of sum()

Hello. I recently tried out Julia because I was hoping for a speed improvement over numpy.sum(). I’m trying to benchmark a sum for a 2D integral and so far my Julia code is much slower. I’m wondering if there’s any performance tips I missed out on. My Julia code benchmark is this

function testfunction(my_arr,other_arr,sec_arr,third_arr)
    for i in 1:length(my_arr)
        new_array[i] =  sum(@. 1 / (my_arr[i] * 1im + sec_arr + other_arr[i,1] ) * third_arr)
    end
    return new_array
end

which when I feed arrays that emulate my usecase ( the arithmetic operations to define the functions are arbitrary, it is just the length of arrays and approximate range that are important).

my_arr = collect(LinRange(0,10,2^15))
other_arr =Array{Float64}(undef, length(my_arr),2)
for i in 1:2
    other_arr[:,i] = my_arr .* 2 .+ (5.3)
end
sec_arr = collect(LinRange(-2,2,600))
third_arr = exp.(sec_arr .+ 2) .* 3
new_array = Array{ComplexF64}(undef, length(my_arr))
@time testfunction(my_arr,other_arr,sec_arr,third_arr)

I get 0.628377 seconds (97.79 k allocations: 305.492 MiB, 3.70% gc time)

In Python, I tried a looped and vectorized method

import numpy as np
import time
def loopcompute(my_arr, other_arr,sec_arr,third_arr):
    new_array = np.empty(len(my_arr), dtype = 'complex128')
    for i in range(len(my_arr)):
        new_array[i] = np.sum((my_arr[i]*1j+ sec_arr + other_arr[i,0] )**(-1)*third_arr)
    return new_array

def vectcompute(my_arr, other_arr,sec_arr,third_arr):
    new_array = np.sum((my_arr[:,None]*1j + sec_arr + other_arr[:,0][:,None])**(-1)*third_arr, axis = 1)
    return new_array

and my test

my_arr = np.linspace(0,10,2**15)
other_arr = np.empty([len(my_arr),2])
for i in range(2):
    other_arr[:,i] = my_arr*2+5.3
sec_arr = np.linspace(-2,2,600)
third_arr = np.exp(sec_arr+2)*3
t11 = time.time()
new_arr = vectcompute(my_arr, other_arr,sec_arr,third_arr)
t12 = time.time()
vectortime = t12 - t11

t21 = time.time()
new_arr2 = loopcompute(my_arr, other_arr,sec_arr,third_arr)
t22 = time.time()
looptime = (t22 - t21)
print(vectortime, looptime, vectortime / looptime)

gives 0.3255 and 0.4755 seconds respectively, a ~ 100% or ~ 75% improvement respectively. (the Python improvement does drop if the second/third array is much longer (>1000) ) but I’m hoping to get this down to 1/10th of a second or else I need to switch to C++. Is there any advice for optimizing the Julia script?

1 Like

I’ll defer to the LoopVectorization/Tullio brigade to get this one down to 10ns or something, but one obvious thing is to pass new_array to avoid the function using a global, which gives me about a 20% speedup compared to your version of the function.

4 Likes

Yes, this was just sloppiness on my part, thanks. Sorry though, what is Tullio?

Using @fastmath seems to speed up the execution by ~5-6 times, producing the same numerical results in this case:

function testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
    for i in 1:length(new_array)
        new_array[i] = @fastmath sum(@. 1.0 / (my_arr[i] * 1im + sec_arr + other_arr[i,1] ) * third_arr)
    end
end

my_arr = collect(LinRange(0,10,2^15))
other_arr =Array{Float64}(undef, length(my_arr),2)
[ other_arr[:,i] = my_arr .* 2 .+ (5.3) for i in 1:2 ]
sec_arr = collect(LinRange(-2,2,600))
third_arr = exp.(sec_arr .+ 2) .* 3
new_array = Array{ComplexF64}(undef, length(my_arr))

@time testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
  0.049 seconds (32.78 k allocations: 304 MiB, 37.69% gc time)

Tullio is a very flexible einsum macro. It understands many array operations written in index notation – not just matrix multiplication and permutations, but also convolutions, stencils, scatter/gather, and broadcasting.

1 Like

Tullio example:

using Tullio, LoopVectorization

function testfunction_tullio!(new_array, my_arr, other_arr, sec_arr, third_arr)
    @tullio new_array[i] = 1 / (my_arr[i] * 1im + sec_arr[j] + other_arr[i,1]) * third_arr[j]
end
julia> @btime testfunction!($new_array, $my_arr, $other_arr, $sec_arr, $third_arr);
  970.667 ms (32768 allocations: 304.00 MiB)

julia> @btime testfunction_tullio!($new_array, $my_arr, $other_arr, $sec_arr, $third_arr);
  1.411 ms (243 allocations: 12.95 KiB)
7 Likes

Stillyslalom, Thanks! Do you mind explaining to me what on earth is going on? Why is Tulia 1000 times faster?

I don’t think it’s a great idea to recommend @fastmath here. Just because the results are the same in this case doesn’t mean they are in general. Plus numpy is not using fast math. This is a recipe for errors and a bad experience for new julia users.

3 Likes

@fastmath can indeed be dangerous in that it’ll silently produce incorrect results if your code violates its requirements, and it’s essential to check against a non-@fastmath version to ensure that it’s accurate enough for your needs. In this case, though, it seems to be essential for effective vectorization.

We can get close to Tullio’s performance using only Base Julia with a few modifications to your function. First, rearranging sum to avoid needing to allocate an intermediate array:

 function testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
   for i in 1:length(my_arr)
       new_array[i] =  sum(1 / (my_arr[i] * 1im + sec_arr[j] + other_arr[i,1] ) * third_arr[j] for j = 1:length(sec_arr))
   end
   return new_array

That brings us down to 816.438 ms (0 allocations: 0 bytes) - note the lack of memory allocations! Removing all allocations from the inner loop, maybe by providing reusable scratchspace arrays or switching from sum(@. calculate_an_array) to sum(f(i) for i in eachindex(array)), is often the first step in getting high-performance code.

We’ve verified that the above function gives correct results and doesn’t access out-of-bounds array indices, so we’ll add an @inbounds annotation to disable bounds-checking and Threads.@threads to allow multithreading over the loop.

 function testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
   @inbounds Threads.@threads for i in 1:length(my_arr)
       new_array[i] =  sum(1 / (my_arr[i] * 1im + sec_arr[j] + other_arr[i,1] ) * third_arr[j] for j = 1:length(sec_arr))
   end
   return new_array
end

We’re down to 52.825 ms (81 allocations: 8.23 KiB) - not too shabby! Threads.@threads introduces allocations, but the multithreading speedup is worth it. As a last effort, adding @fastmath:

 function testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
   @inbounds @fastmath Threads.@threads for i in 1:length(my_arr)
       new_array[i] =  sum(1 / (my_arr[i] * 1im + sec_arr[j] + other_arr[i,1] ) * third_arr[j] for j = 1:length(sec_arr))
   end
   return new_array
end

That gives us 5.283 ms (81 allocations: 8.23 KiB), which is maybe the biggest speedup I’ve ever seen from @fastmath.

6 Likes

Does most of the speed-up come from multithreading in Tullio as well? Ideally I’m going to be running this as a subroutine of a function that’s already parallelized, so I need to be careful about how many threads I’m actually calling or if I can get any advantage on a single thread.

You can disable threading with @tullio threads=false - it’s still plenty fast on a single thread: 9.576 ms (0 allocations: 0 bytes) on my 8-core machine.

3 Likes

There’s usually no need to disable threading in code that uses @threads. This method of multithreading is cooperative, the outermost invocation gets threaded as far as I know.

This 100x speed up on a single thread comes from LoopVectorization in Tullio then? Why is a for loop so costly in general? Is it the intermediary data allocation you described here or is it just referencing ?

Julia’s for loops are generally fast - the original loop is just unusually slow. I suspect it’s the mixed Float64/ComplexF64 inputs. Converting all the inputs to ComplexF64 brings the runtime for the simplest version in my earlier post from 816.438 ms (0 allocations: 0 bytes). to 239.854 ms (0 allocations: 0 bytes), even though more computations are performed. This seems like a case where Julia’s codegen is falling short - the @fastmath macro may have encouraged the compiler to use the complex reciprocal identity 1 / (x + \hat{\imath}y) = (x - \hat{\imath}y)/(x^2 + y^2). Rewriting the inner loop to take advantage of the identity, we get

function testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
    @inbounds for i in 1:length(my_arr)
        v = zero(eltype(new_array))
        for j in 1:length(sec_arr)
            x = sec_arr[j] + other_arr[i,1]
            y = my_arr[i]
            v += third_arr[j] * (x - im * y) / (x^2 + y^2)
        end
        new_array[i] = v
    end
    return new_array
end

which gives us 24.651 ms (0 allocations: 0 bytes). Tullio/LV.jl’s better use of SIMD intrinsics probably accounts for the rest of the difference.

**edit: adding a @simd annotation on the inner loop brings this version down to 11.770 ms.
**edit 2: adding a @fastmath annotation brings this down even further to 6.296 ms on a single thread.

6 Likes

@fastmath is actually pretty safe since it is locally scoped . As long as you aren’t doing tricky stuff that relies on ieee details (eg compensated summation), and don’t mind losing a few ulps of accuracy, it is perfectly safe, and can give a good performance boost.

4 Likes

This is about twice as fast as @fastmath @inbounds on the outer loop + a redundant @simd on the inner loop:

julia> function testfunction_lv!(new_array::AbstractVector{Complex{T}}, my_arr,other_arr,sec_arr,third_arr) where {T}
           new_mat = reinterpret(reshape, T, new_array)
           @turbo for i in 1:length(my_arr)
               v_re = zero(T)
               v_im = zero(T)
               for j in 1:length(sec_arr)
                   x = sec_arr[j] + other_arr[i,1]
                   y = my_arr[i]
                   s = third_arr[j] / (x^2 + y^2)
                   v_re += s*x
                   v_im -= s*y
               end
               new_mat[1,i] = v_re
               new_mat[2,i] = v_im
           end
           return new_array
       end
testfunction_lv! (generic function with 1 method)

julia> function testfunction!(new_array, my_arr,other_arr,sec_arr,third_arr)
           @fastmath @inbounds for i in 1:length(my_arr)
               v = zero(eltype(new_array))
               @simd for j in 1:length(sec_arr)
                   x = sec_arr[j] + other_arr[i,1]
                   y = my_arr[i]
                   v += third_arr[j] * (x - im * y) / (x^2 + y^2)
               end
               new_array[i] = v
           end
           return new_array
       end
testfunction! (generic function with 1 method)

julia> my_arr = collect(LinRange(0,10,2^15));

julia> other_arr =Array{Float64}(undef, length(my_arr),2);

julia> for i in 1:2
           other_arr[:,i] = my_arr .* 2 .+ (5.3)
       end;

julia> sec_arr = collect(LinRange(-2,2,600));

julia> third_arr = exp.(sec_arr .+ 2) .* 3;

julia> new_array = Array{ComplexF64}(undef, length(my_arr));

Benchmark results:

julia> @btime testfunction!($new_array, $my_arr, $other_arr, $sec_arr, $third_arr);
  11.438 ms (0 allocations: 0 bytes)

julia> @btime testfunction_lv!($new_array, $my_arr, $other_arr, $sec_arr, $third_arr);
  5.090 ms (0 allocations: 0 bytes)

Note that LoopVectorization doesn’t support complex numbers at the moment, so we have to reinterpret the array for @turbo to work. (If we don’t reinterpret, it’ll silently use @inbounds @fastmath instead).

3 Likes

Without @fastmath, complex division is much more expensive because of overflow checks and similar corner cases. For example, the complex reciprocal “identity” you cited will overflow in the denominator if x = 1e200 (giving a division by Inf, resulting in 0, or even NaN if y is Inf).

So, with @fastmath, you aren’t just asserting that you are insensitive to slightly larger floating-point errors, or changes in the associativity of operations. You are also asserting that your numbers aren’t too big or too small, and/or you aren’t concerned with correct handling of over/underflow situations and Inf/NaN values. Which is often true in user code, though library authors need to be more careful since they don’t know the context in which their library will be used.

6 Likes

x and y (and hence the denominator) are both Float64 here. Only the numerator is complex.
So the definition should simply be:

/(z::Complex, x::Real) = Complex(real(z)/x, imag(z)/x)

However, @fastmath is allowing transforming this:

a = Complex(real(a) + real(z)/x, imag(a) - imag(z)/x)

into

invx = inv(x)
a = Complex(fma(real(z), invx, real(a)), fnma(imag(z), invx, imag(a)))

The division is by far the slowest operation on some computers (including mine), so cutting the number of divisions in half can lead to substantial speedups.

3 Likes

I was talking about the transformation from the original code 1 / (my_arr[i] * 1im + sec_arr + other_arr[i,1] ) * third_arr to the “simplified” version third_arr[j] * (x - im * y) / (x^2 + y^2). The new code gives the wrong answer (it loses all the significant digits) if x or y are so big that x^2 + y^2 overflows:

julia> x,y = 1e200, 1
(1.0e200, 1)

julia> 1/(x+im*y)
1.0e-200 - 0.0im

julia> (x - im * y) / (x^2 + y^2)
0.0 - 0.0im

julia> @fastmath 1/(x+im*y)
0.0 - 0.0im

This is perfectly fine if you are using it in a context where you know this will never happen because the magnitudes don’t get too big (\gtrsim 10^{150}) or too small (\lesssim 10^{-150}), or if you don’t care if it gets the wrong answer in such cases.

2 Likes

Ah, thanks for pointing that out and the explanation.
That transform was also implemented manually in some of the later versions.

1 Like