Need help on improving performance of a small subfunction

There is a function in my main program and I have realised that this function consumes most of time in my program. Is it possible to speed up this mentioned function given below

function fun1(r::Matrix,od::Int64,γ::Float64,e::Float64)

    n = size(r)[1]; p = zero(r);
    if od == 3
         for j in 1:n
            for i in 1:n
                 p[i,j] =  r[i,j]^3*γ + exp(-(r[i,j]*e)^2)
            end
        end
    elseif od == 5
        for j in 1:n
           for i in 1:n
                 p[i,j] =  r[i,j]^5*γ + exp(-(r[i,j]*e)^2)
            end
        end
    elseif od == 7
        for j in 1:n
           for i in 1:n
                 p[i,j] =  r[i,j]^7*γ + exp(-(r[i,j]*e)^2)
            end
        end
    else
        println("enough")
    end
    return p
end

updated codes, after answer of tamasgal

using Printf
using BenchmarkTools

function fun2(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    for j in 1:n
        for i in 1:n
             p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
        end
    end
    return p
end

I used @simd before for loops and @inbounds before p[i,j], but no significant difference in speed. On the other using @fastmath gives some improvement however the results are not correct always. Also used Profiler and @code_warntype but I could not achive any improvement.
Thank you.

Elapsed time for me:

@benchmark  fun1(rand(150,150),7,2.0,1.0)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.251 ms (0.00% GC)
  median time:      1.269 ms (0.00% GC)
  mean time:        1.281 ms (0.16% GC)
  maximum time:     2.449 ms (35.99% GC)
  --------------
  samples:          3897
  evals/sample:     1

First, you already imported BenchmarkTools, so make use of it. With your code you measure the compilation time too when you first call fun(...). The @btime macro from BenchmarkTools takes care of this and also runs the function multiple times to get more statistics:

julia> @btime fun(r,od,γ,e);
552.992 μs (2 allocations: 175.89 KiB)

Second: I don’t understand why you do this weird if od == ... instead of simply passing od as a variable. This function is the same as yours, except that it does not print "enough" if you pass in anything else than 3, 5, 7 or 9 for od:

function fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    for j in 1:n
        for i in 1:n
             p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
        end
    end
    return p
end

with the same performance:

julia> @btime fun(r,od,γ,e);
551.100 μs (2 allocations: 175.89 KiB)

You can simplify it even more by getting rid of the nested loop and use eachindex(), however, it’s slightly slower (no idea why):

function fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    @simd for i in eachindex(r)
         @inbounds p[i] = r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return p
end
julia> @btime fun(r, od, γ, e);
610.206 μs (2 allocations: 175.89 KiB)

Other than that, I don’t see any obvious improvement possibilities, but I am sure others will :wink:

Thank you for your quick answer. I have edit the function according to your suggestion. In my machine new elapsed time is as follows:
BenchmarkTools.Trial:

  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.255 ms (0.00% GC)
  median time:      1.269 ms (0.00% GC)
  mean time:        1.281 ms (0.15% GC)
  maximum time:     1.856 ms (0.00% GC)
  --------------
  samples:          3899
  evals/sample:     1

The elapsed time for the first version was as follows:

BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.251 ms (0.00% GC)
  median time:      1.270 ms (0.00% GC)
  mean time:        1.281 ms (0.18% GC)
  maximum time:     2.133 ms (37.85% GC)
  --------------
  samples:          3899
  evals/sample:     1

If I call the functions with od=3, then fun1 is faster than fun2. Why?

@benchmark  fun(rand(150,150),3,2.0,1.0)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     229.494 μs (0.00% GC)
  median time:      235.477 μs (0.00% GC)
  mean time:        240.380 μs (0.87% GC)
  maximum time:     976.972 μs (74.30% GC)
  --------------
  samples:          10000
  evals/sample:     1
@benchmark  fun2(rand(150,150),3,2.0,1.0)
BenchmarkTools.Trial: 
  memory estimate:  351.78 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.271 ms (0.00% GC)
  median time:      1.291 ms (0.00% GC)
  mean time:        1.305 ms (0.37% GC)
  maximum time:     2.099 ms (35.34% GC)
  --------------
  samples:          3829
  evals/sample:     1

Now you are benchmarking the generation of random numbers as well, I don’t think that is intentional.

1 Like

Could you try it with variable interpolation with $?

Yes I did originally, it made no difference. I am really confused why eachindex() is slower… I am always using it without further checks.

Great use-case for LooVectorization, if your hardware supports AVX:

using BenchmarkTools, LoopVectorization

function fun2(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    for j in 1:n
        for i in 1:n
             p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
        end
    end
    return p
end

function fun3(r::Matrix, od::Int64, γ::Float64, e::Float64)
    p = zero(r)
    @avx for i in eachindex(r)
        p[i] =  r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return p
end


r = rand(150,150)
od = 7
γ = 2.0
e = 1.0

@benchmark fun2($r,$od,$γ,$e)

@benchmark fun3($r,$od,$γ,$e)

Results:

julia> @benchmark fun2($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.572 ms (0.00% GC) 
  median time:      1.643 ms (0.00% GC) 
  mean time:        1.671 ms (0.34% GC) 
  maximum time:     4.655 ms (61.94% GC)
  --------------
  samples:          2984
  evals/sample:     1


julia> @benchmark fun3($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     237.100 μs (0.00% GC)
  median time:      258.300 μs (0.00% GC)
  mean time:        274.366 μs (1.78% GC)
  maximum time:     3.606 ms (91.37% GC)
  --------------
  samples:          10000
  evals/sample:     1
2 Likes

As an aside, you don’t need to specify the exact types in the function signature. For example, you pass γ = 2.0 which you could just as an integer 2 instead. To allow both cases, relax the signature from:

fun(r::Matrix, od::Int64, γ::Float64, e::Float64)

to

fun(r::Matrix, od::Integer, γ::Real, e::Real)

This would allow you to pass e.g. BigFloat for increased precision (though for that to work you would also need to tweak how you allocate the p output matrix).

2 Likes

It’s possible to also use @threads with @avx, though I saw no speedup for these array sizes.

You can shave off a few percent, however, by writing p = similar(r) instead of p = zero(r).

Here is one compact way to add multi-threading which seems to help quite a bit:

r150 = rand(150,150); # same functions as above:
@btime fun1($r150, 7, 2.0, 1.0); # 488.889 μs (2 allocations: 175.89 KiB)
@btime fun2($r150, 7, 2.0, 1.0); # 491.793 μs (2 allocations: 175.89 KiB)

using Einsum
# the same loops as fun2, just notation:
fun_ein(r, od, γ, e) = @einsum p[i,j] := γ * r[i,j]^od + exp(-(r[i,j]*e)^2)
# and a variant with Threads.@threads:
fun_viel(r, od, γ, e) = @vielsum p[i,j] := γ * r[i,j]^od + exp(-(r[i,j]*e)^2)

fun_ein(r150, 7, 2.0, 1.0) ≈ fun1(r150, 7, 2.0, 1.0) # true

@btime fun_ein($r150, 7, 2.0, 1.0);  # 478.953 μs (2 allocations: 175.89 KiB)
@btime fun_viel($r150, 7, 2.0, 1.0); #  93.681 μs (65 allocations: 183.95 KiB)

This is on a 6-core machine, YMMV.

I don’t see as much speedup from LoopVectorization as @jebej did. But trying out an (un-registered, still-unreliable) package which uses that:

using LoopVectorization, Tullio
@btime fun3($r150, 7, 2.0, 1.0); # 261.102 μs (2 allocations: 175.89 KiB)

# this also uses @avx:
fun_tul(r, od, γ, e) = @tullio p[i,j] := $γ * r[i,j]^($od) + exp(-(r[i,j]*$e)^2)
# lower threshold to make it @spawn threads
fun_tul2(r, od, γ, e) = @tullio p[i,j] := $γ * r[i,j]^($od) + exp(-(r[i,j]*$e)^2) threads=2^12

@btime fun_tul($r150, 7, 2.0, 1.0);  # 256.528 μs (8 allocations: 176.08 KiB
@btime fun_tul2($r150, 7, 2.0, 1.0); #  62.490 μs (168 allocations: 188.83 KiB)
1 Like

Another thought is that you can easily do this with broadcasting. The basic version is the same speed as fun1, and:

using LoopVectorization
fun_avx(r, od, γ, e) = @avx @. γ * r^od + exp(-(r*e)^2)

using Strided # another way to add multi-threading
fun_str(r, od, γ, e) = @strided γ .* r.^od .+ exp.(.-(r.*e).^2)

@btime fun_avx($r150, 7, 2.0, 1.0); # 258.900 μs (2 allocations: 175.89 KiB)
@btime fun_str($r150, 7, 2.0, 1.0); #  99.693 μs (91 allocations: 187.45 KiB)

And finally, here’s how to get SIMD to work without loading packages:

function fun4(r::Matrix, od::Int64, γ::Float64, e::Float64)
    p = similar(r);
    @fastmath @simd for i in eachindex(r)
         @inbounds p[i] = r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return p
end
@btime fun4($r150, 7, 2.0, 1.0); # 198.393 μs (2 allocations: 175.89 KiB)
2 Likes

My machine is also 6 core. I do not understand but, Einsum is not providing speedup for me. Here are the results

@benchmark fun_viel($r, 7, 2.0, 1.0)
BenchmarkTools.Trial: 
  memory estimate:  175.95 KiB
  allocs estimate:  3
  --------------
  minimum time:     1.294 ms (0.00% GC)
  median time:      1.298 ms (0.00% GC)
  mean time:        1.325 ms (0.54% GC)
  maximum time:     2.559 ms (43.18% GC)
  --------------
  samples:          3769
  evals/sample:     1

LoopVectorization worked for me. Thank you.

What does Threads.nthreads() return? For multi-threading you have to start Julia with an environment variable, whose spelling I forget but it is in the manual.

This is probably because of Base.literal_pow, which for power 3 should become x*x*x instead of x^p. For example:

p3(x) = x^3
@code_lowered p3(2.0)
@code_llvm p3(2.0)
pn(x,n) = x^n
@code_lowered pn(2.0, 3)
@code_llvm pn(2.0, 3)
1 Like

Thank you for point out Base.literal_pow ,
I use Atom ide and I set number of threads to 6 in settings. And now it retruns 6. But still the function runs slower both in Atom and Julia terminal.

I don’t know, that’s weird. Do other threaded things work? Is the threaded one still slower for say rand(2000,2000) instead? Perhaps in other versions of Julia?

I tried for rand(2000,2000). Here are timings

@benchmark fun_str($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  57
  --------------
  minimum time:     40.636 ms (0.00% GC)
  median time:      41.096 ms (0.00% GC)
  mean time:        41.350 ms (0.49% GC)
  maximum time:     47.168 ms (0.00% GC)
  --------------
  samples:          121
  evals/sample:     1
@benchmark fun3($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     46.655 ms (0.00% GC)
  median time:      47.504 ms (0.00% GC)
  mean time:        47.680 ms (0.50% GC)
  maximum time:     49.086 ms (0.00% GC)
  --------------
  samples:          105
  evals/sample:     1
 @benchmark fun_viel($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  35
  --------------
  minimum time:     41.211 ms (0.00% GC)
  median time:      41.974 ms (0.00% GC)
  mean time:        42.253 ms (0.50% GC)
  maximum time:     47.335 ms (0.00% GC)
  --------------
  samples:          119
  evals/sample:     1

For large matrices, threaded looks better.

Infinitesimally better, you’d expect close to 6x. I have no idea why it’s not working; something is happening since you are seeing a few more allocations. Does this print different numbers?

Threads.@threads for i in 1:10
    @show Threads.threadid()                                      
end

Yes,

Threads.threadid() = 2
Threads.threadid() = 1
Threads.threadid() = 4
Threads.threadid() = 5
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 4
Threads.threadid() = 6
Threads.threadid() = 3
Threads.threadid() = 3

Does the fastmath version of exp SIMD? If so, that’s great to know :slight_smile: