Need help on improving performance of a small subfunction

O_Julia · May 3, 2020, 2:32pm

There is a function in my main program and I have realised that this function consumes most of time in my program. Is it possible to speed up this mentioned function given below

function fun1(r::Matrix,od::Int64,γ::Float64,e::Float64)

    n = size(r)[1]; p = zero(r);
    if od == 3
         for j in 1:n
            for i in 1:n
                 p[i,j] =  r[i,j]^3*γ + exp(-(r[i,j]*e)^2)
            end
        end
    elseif od == 5
        for j in 1:n
           for i in 1:n
                 p[i,j] =  r[i,j]^5*γ + exp(-(r[i,j]*e)^2)
            end
        end
    elseif od == 7
        for j in 1:n
           for i in 1:n
                 p[i,j] =  r[i,j]^7*γ + exp(-(r[i,j]*e)^2)
            end
        end
    else
        println("enough")
    end
    return p
end

updated codes, after answer of tamasgal

using Printf
using BenchmarkTools

function fun2(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    for j in 1:n
        for i in 1:n
             p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
        end
    end
    return p
end

I used @simd before for loops and @inbounds before p[i,j], but no significant difference in speed. On the other using @fastmath gives some improvement however the results are not correct always. Also used Profiler and @code_warntype but I could not achive any improvement.
Thank you.

Elapsed time for me:

@benchmark  fun1(rand(150,150),7,2.0,1.0)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.251 ms (0.00% GC)
  median time:      1.269 ms (0.00% GC)
  mean time:        1.281 ms (0.16% GC)
  maximum time:     2.449 ms (35.99% GC)
  --------------
  samples:          3897
  evals/sample:     1

tamasgal · May 3, 2020, 2:53pm

First, you already imported BenchmarkTools, so make use of it. With your code you measure the compilation time too when you first call fun(...). The @btime macro from BenchmarkTools takes care of this and also runs the function multiple times to get more statistics:

julia> @btime fun(r,od,γ,e);
552.992 μs (2 allocations: 175.89 KiB)

Second: I don’t understand why you do this weird if od == ... instead of simply passing od as a variable. This function is the same as yours, except that it does not print "enough" if you pass in anything else than 3, 5, 7 or 9 for od:

function fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    for j in 1:n
        for i in 1:n
             p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
        end
    end
    return p
end

with the same performance:

julia> @btime fun(r,od,γ,e);
551.100 μs (2 allocations: 175.89 KiB)

You can simplify it even more by getting rid of the nested loop and use eachindex(), however, it’s slightly slower (no idea why):

function fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    @simd for i in eachindex(r)
         @inbounds p[i] = r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return p
end

julia> @btime fun(r, od, γ, e);
610.206 μs (2 allocations: 175.89 KiB)

Other than that, I don’t see any obvious improvement possibilities, but I am sure others will

O_Julia · May 3, 2020, 3:11pm

Thank you for your quick answer. I have edit the function according to your suggestion. In my machine new elapsed time is as follows:
BenchmarkTools.Trial:

  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.255 ms (0.00% GC)
  median time:      1.269 ms (0.00% GC)
  mean time:        1.281 ms (0.15% GC)
  maximum time:     1.856 ms (0.00% GC)
  --------------
  samples:          3899
  evals/sample:     1

The elapsed time for the first version was as follows:

BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.251 ms (0.00% GC)
  median time:      1.270 ms (0.00% GC)
  mean time:        1.281 ms (0.18% GC)
  maximum time:     2.133 ms (37.85% GC)
  --------------
  samples:          3899
  evals/sample:     1

If I call the functions with od=3, then fun1 is faster than fun2. Why?

@benchmark  fun(rand(150,150),3,2.0,1.0)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     229.494 μs (0.00% GC)
  median time:      235.477 μs (0.00% GC)
  mean time:        240.380 μs (0.87% GC)
  maximum time:     976.972 μs (74.30% GC)
  --------------
  samples:          10000
  evals/sample:     1

@benchmark  fun2(rand(150,150),3,2.0,1.0)
BenchmarkTools.Trial: 
  memory estimate:  351.78 KiB
  allocs estimate:  4
  --------------
  minimum time:     1.271 ms (0.00% GC)
  median time:      1.291 ms (0.00% GC)
  mean time:        1.305 ms (0.37% GC)
  maximum time:     2.099 ms (35.34% GC)
  --------------
  samples:          3829
  evals/sample:     1

DNF · May 3, 2020, 5:43pm

Now you are benchmarking the generation of random numbers as well, I don’t think that is intentional.

DNF · May 3, 2020, 5:44pm

Could you try it with variable interpolation with $?

tamasgal · May 3, 2020, 5:48pm

Yes I did originally, it made no difference. I am really confused why eachindex() is slower… I am always using it without further checks.

jebej · May 3, 2020, 5:52pm

Great use-case for LooVectorization, if your hardware supports AVX:

using BenchmarkTools, LoopVectorization

function fun2(r::Matrix, od::Int64, γ::Float64, e::Float64)
    n = size(r)[1]; p = zero(r);
    for j in 1:n
        for i in 1:n
             p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
        end
    end
    return p
end

function fun3(r::Matrix, od::Int64, γ::Float64, e::Float64)
    p = zero(r)
    @avx for i in eachindex(r)
        p[i] =  r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return p
end


r = rand(150,150)
od = 7
γ = 2.0
e = 1.0

@benchmark fun2($r,$od,$γ,$e)

@benchmark fun3($r,$od,$γ,$e)

Results:

julia> @benchmark fun2($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     1.572 ms (0.00% GC) 
  median time:      1.643 ms (0.00% GC) 
  mean time:        1.671 ms (0.34% GC) 
  maximum time:     4.655 ms (61.94% GC)
  --------------
  samples:          2984
  evals/sample:     1


julia> @benchmark fun3($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  175.89 KiB
  allocs estimate:  2
  --------------
  minimum time:     237.100 μs (0.00% GC)
  median time:      258.300 μs (0.00% GC)
  mean time:        274.366 μs (1.78% GC)
  maximum time:     3.606 ms (91.37% GC)
  --------------
  samples:          10000
  evals/sample:     1

jebej · May 3, 2020, 6:00pm

As an aside, you don’t need to specify the exact types in the function signature. For example, you pass γ = 2.0 which you could just as an integer 2 instead. To allow both cases, relax the signature from:

fun(r::Matrix, od::Int64, γ::Float64, e::Float64)

to

fun(r::Matrix, od::Integer, γ::Real, e::Real)

This would allow you to pass e.g. BigFloat for increased precision (though for that to work you would also need to tweak how you allocate the p output matrix).

DNF · May 3, 2020, 6:21pm

It’s possible to also use @threads with @avx, though I saw no speedup for these array sizes.

You can shave off a few percent, however, by writing p = similar(r) instead of p = zero(r).

mcabbott · May 3, 2020, 6:46pm

Here is one compact way to add multi-threading which seems to help quite a bit:

r150 = rand(150,150); # same functions as above:
@btime fun1($r150, 7, 2.0, 1.0); # 488.889 μs (2 allocations: 175.89 KiB)
@btime fun2($r150, 7, 2.0, 1.0); # 491.793 μs (2 allocations: 175.89 KiB)

using Einsum
# the same loops as fun2, just notation:
fun_ein(r, od, γ, e) = @einsum p[i,j] := γ * r[i,j]^od + exp(-(r[i,j]*e)^2)
# and a variant with Threads.@threads:
fun_viel(r, od, γ, e) = @vielsum p[i,j] := γ * r[i,j]^od + exp(-(r[i,j]*e)^2)

fun_ein(r150, 7, 2.0, 1.0) ≈ fun1(r150, 7, 2.0, 1.0) # true

@btime fun_ein($r150, 7, 2.0, 1.0);  # 478.953 μs (2 allocations: 175.89 KiB)
@btime fun_viel($r150, 7, 2.0, 1.0); #  93.681 μs (65 allocations: 183.95 KiB)

This is on a 6-core machine, YMMV.

I don’t see as much speedup from LoopVectorization as @jebej did. But trying out an (un-registered, still-unreliable) package which uses that:

using LoopVectorization, Tullio
@btime fun3($r150, 7, 2.0, 1.0); # 261.102 μs (2 allocations: 175.89 KiB)

# this also uses @avx:
fun_tul(r, od, γ, e) = @tullio p[i,j] := $γ * r[i,j]^($od) + exp(-(r[i,j]*$e)^2)
# lower threshold to make it @spawn threads
fun_tul2(r, od, γ, e) = @tullio p[i,j] := $γ * r[i,j]^($od) + exp(-(r[i,j]*$e)^2) threads=2^12

@btime fun_tul($r150, 7, 2.0, 1.0);  # 256.528 μs (8 allocations: 176.08 KiB
@btime fun_tul2($r150, 7, 2.0, 1.0); #  62.490 μs (168 allocations: 188.83 KiB)

mcabbott · May 3, 2020, 6:58pm

Another thought is that you can easily do this with broadcasting. The basic version is the same speed as fun1, and:

using LoopVectorization
fun_avx(r, od, γ, e) = @avx @. γ * r^od + exp(-(r*e)^2)

using Strided # another way to add multi-threading
fun_str(r, od, γ, e) = @strided γ .* r.^od .+ exp.(.-(r.*e).^2)

@btime fun_avx($r150, 7, 2.0, 1.0); # 258.900 μs (2 allocations: 175.89 KiB)
@btime fun_str($r150, 7, 2.0, 1.0); #  99.693 μs (91 allocations: 187.45 KiB)

And finally, here’s how to get SIMD to work without loading packages:

function fun4(r::Matrix, od::Int64, γ::Float64, e::Float64)
    p = similar(r);
    @fastmath @simd for i in eachindex(r)
         @inbounds p[i] = r[i]^od*γ + exp(-(r[i]*e)^2)
    end
    return p
end
@btime fun4($r150, 7, 2.0, 1.0); # 198.393 μs (2 allocations: 175.89 KiB)

O_Julia · May 3, 2020, 7:41pm

My machine is also 6 core. I do not understand but, Einsum is not providing speedup for me. Here are the results

@benchmark fun_viel($r, 7, 2.0, 1.0)
BenchmarkTools.Trial: 
  memory estimate:  175.95 KiB
  allocs estimate:  3
  --------------
  minimum time:     1.294 ms (0.00% GC)
  median time:      1.298 ms (0.00% GC)
  mean time:        1.325 ms (0.54% GC)
  maximum time:     2.559 ms (43.18% GC)
  --------------
  samples:          3769
  evals/sample:     1

O_Julia · May 3, 2020, 7:44pm

LoopVectorization worked for me. Thank you.

mcabbott · May 3, 2020, 7:49pm

What does Threads.nthreads() return? For multi-threading you have to start Julia with an environment variable, whose spelling I forget but it is in the manual.

This is probably because of Base.literal_pow, which for power 3 should become x*x*x instead of x^p. For example:

p3(x) = x^3
@code_lowered p3(2.0)
@code_llvm p3(2.0)
pn(x,n) = x^n
@code_lowered pn(2.0, 3)
@code_llvm pn(2.0, 3)

O_Julia · May 3, 2020, 8:01pm

Thank you for point out Base.literal_pow ,
I use Atom ide and I set number of threads to 6 in settings. And now it retruns 6. But still the function runs slower both in Atom and Julia terminal.

mcabbott · May 3, 2020, 9:44pm

I don’t know, that’s weird. Do other threaded things work? Is the threaded one still slower for say rand(2000,2000) instead? Perhaps in other versions of Julia?

O_Julia · May 3, 2020, 10:00pm

I tried for rand(2000,2000). Here are timings

@benchmark fun_str($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  57
  --------------
  minimum time:     40.636 ms (0.00% GC)
  median time:      41.096 ms (0.00% GC)
  mean time:        41.350 ms (0.49% GC)
  maximum time:     47.168 ms (0.00% GC)
  --------------
  samples:          121
  evals/sample:     1

@benchmark fun3($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     46.655 ms (0.00% GC)
  median time:      47.504 ms (0.00% GC)
  mean time:        47.680 ms (0.50% GC)
  maximum time:     49.086 ms (0.00% GC)
  --------------
  samples:          105
  evals/sample:     1

 @benchmark fun_viel($r,$od,$γ,$e)
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  35
  --------------
  minimum time:     41.211 ms (0.00% GC)
  median time:      41.974 ms (0.00% GC)
  mean time:        42.253 ms (0.50% GC)
  maximum time:     47.335 ms (0.00% GC)
  --------------
  samples:          119
  evals/sample:     1

For large matrices, threaded looks better.

mcabbott · May 3, 2020, 10:20pm

Infinitesimally better, you’d expect close to 6x. I have no idea why it’s not working; something is happening since you are seeing a few more allocations. Does this print different numbers?

Threads.@threads for i in 1:10
    @show Threads.threadid()                                      
end

O_Julia · May 3, 2020, 10:29pm

Yes,

Threads.threadid() = 2
Threads.threadid() = 1
Threads.threadid() = 4
Threads.threadid() = 5
Threads.threadid() = 1
Threads.threadid() = 2
Threads.threadid() = 4
Threads.threadid() = 6
Threads.threadid() = 3
Threads.threadid() = 3

baggepinnen · May 4, 2020, 11:00am

Does the fastmath version of exp SIMD? If so, that’s great to know

Topic		Replies	Views
Any suggestion for speeding up my function? Performance question	11	427	November 17, 2022
Improving Performance of a Loop Performance	9	416	November 3, 2021
Help Improving Performance of a Loop Performance performance , loops	15	1074	February 16, 2021
Improving speed with iterative sums and functions within functions Performance question	4	312	January 17, 2024
Matrix computation Performance	6	765	July 27, 2020

Need help on improving performance of a small subfunction

Related topics