# Need help on improving performance of a small subfunction

There is a function in my main program and I have realised that this function consumes most of time in my program. Is it possible to speed up this mentioned function given below

``````function fun1(r::Matrix,od::Int64,γ::Float64,e::Float64)

n = size(r); p = zero(r);
if od == 3
for j in 1:n
for i in 1:n
p[i,j] =  r[i,j]^3*γ + exp(-(r[i,j]*e)^2)
end
end
elseif od == 5
for j in 1:n
for i in 1:n
p[i,j] =  r[i,j]^5*γ + exp(-(r[i,j]*e)^2)
end
end
elseif od == 7
for j in 1:n
for i in 1:n
p[i,j] =  r[i,j]^7*γ + exp(-(r[i,j]*e)^2)
end
end
else
println("enough")
end
return p
end
``````

updated codes, after answer of tamasgal

``````using Printf
using BenchmarkTools

function fun2(r::Matrix, od::Int64, γ::Float64, e::Float64)
n = size(r); p = zero(r);
for j in 1:n
for i in 1:n
p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
end
end
return p
end

``````

I used `@simd` before for loops and `@inbounds` before `p[i,j]`, but no significant difference in speed. On the other using @fastmath gives some improvement however the results are not correct always. Also used Profiler and @code_warntype but I could not achive any improvement.
Thank you.

Elapsed time for me:

``````@benchmark  fun1(rand(150,150),7,2.0,1.0)
BenchmarkTools.Trial:
memory estimate:  175.89 KiB
allocs estimate:  2
--------------
minimum time:     1.251 ms (0.00% GC)
median time:      1.269 ms (0.00% GC)
mean time:        1.281 ms (0.16% GC)
maximum time:     2.449 ms (35.99% GC)
--------------
samples:          3897
evals/sample:     1
``````

First, you already imported `BenchmarkTools`, so make use of it. With your code you measure the compilation time too when you first call `fun(...)`. The `@btime` macro from `BenchmarkTools` takes care of this and also runs the function multiple times to get more statistics:

``````julia> @btime fun(r,od,γ,e);
552.992 μs (2 allocations: 175.89 KiB)
``````

Second: I don’t understand why you do this weird `if od == ...` instead of simply passing `od` as a variable. This function is the same as yours, except that it does not print `"enough"` if you pass in anything else than 3, 5, 7 or 9 for `od`:

``````function fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
n = size(r); p = zero(r);
for j in 1:n
for i in 1:n
p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
end
end
return p
end
``````

with the same performance:

``````julia> @btime fun(r,od,γ,e);
551.100 μs (2 allocations: 175.89 KiB)
``````

You can simplify it even more by getting rid of the nested loop and use `eachindex()`, however, it’s slightly slower (no idea why):

``````function fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
n = size(r); p = zero(r);
@simd for i in eachindex(r)
@inbounds p[i] = r[i]^od*γ + exp(-(r[i]*e)^2)
end
return p
end
``````
``````julia> @btime fun(r, od, γ, e);
610.206 μs (2 allocations: 175.89 KiB)
``````

Other than that, I don’t see any obvious improvement possibilities, but I am sure others will Thank you for your quick answer. I have edit the function according to your suggestion. In my machine new elapsed time is as follows:
BenchmarkTools.Trial:

``````  memory estimate:  175.89 KiB
allocs estimate:  2
--------------
minimum time:     1.255 ms (0.00% GC)
median time:      1.269 ms (0.00% GC)
mean time:        1.281 ms (0.15% GC)
maximum time:     1.856 ms (0.00% GC)
--------------
samples:          3899
evals/sample:     1
``````

The elapsed time for the first version was as follows:

``````BenchmarkTools.Trial:
memory estimate:  175.89 KiB
allocs estimate:  2
--------------
minimum time:     1.251 ms (0.00% GC)
median time:      1.270 ms (0.00% GC)
mean time:        1.281 ms (0.18% GC)
maximum time:     2.133 ms (37.85% GC)
--------------
samples:          3899
evals/sample:     1
``````

If I call the functions with od=3, then fun1 is faster than fun2. Why?

``````@benchmark  fun(rand(150,150),3,2.0,1.0)
BenchmarkTools.Trial:
memory estimate:  175.89 KiB
allocs estimate:  2
--------------
minimum time:     229.494 μs (0.00% GC)
median time:      235.477 μs (0.00% GC)
mean time:        240.380 μs (0.87% GC)
maximum time:     976.972 μs (74.30% GC)
--------------
samples:          10000
evals/sample:     1
``````
``````@benchmark  fun2(rand(150,150),3,2.0,1.0)
BenchmarkTools.Trial:
memory estimate:  351.78 KiB
allocs estimate:  4
--------------
minimum time:     1.271 ms (0.00% GC)
median time:      1.291 ms (0.00% GC)
mean time:        1.305 ms (0.37% GC)
maximum time:     2.099 ms (35.34% GC)
--------------
samples:          3829
evals/sample:     1
``````

Now you are benchmarking the generation of random numbers as well, I don’t think that is intentional.

1 Like

Could you try it with variable interpolation with `\$`?

Yes I did originally, it made no difference. I am really confused why `eachindex()` is slower… I am always using it without further checks.

Great use-case for LooVectorization, if your hardware supports AVX:

``````using BenchmarkTools, LoopVectorization

function fun2(r::Matrix, od::Int64, γ::Float64, e::Float64)
n = size(r); p = zero(r);
for j in 1:n
for i in 1:n
p[i,j] =  r[i,j]^od*γ + exp(-(r[i,j]*e)^2)
end
end
return p
end

function fun3(r::Matrix, od::Int64, γ::Float64, e::Float64)
p = zero(r)
@avx for i in eachindex(r)
p[i] =  r[i]^od*γ + exp(-(r[i]*e)^2)
end
return p
end

r = rand(150,150)
od = 7
γ = 2.0
e = 1.0

@benchmark fun2(\$r,\$od,\$γ,\$e)

@benchmark fun3(\$r,\$od,\$γ,\$e)
``````

Results:

``````julia> @benchmark fun2(\$r,\$od,\$γ,\$e)
BenchmarkTools.Trial:
memory estimate:  175.89 KiB
allocs estimate:  2
--------------
minimum time:     1.572 ms (0.00% GC)
median time:      1.643 ms (0.00% GC)
mean time:        1.671 ms (0.34% GC)
maximum time:     4.655 ms (61.94% GC)
--------------
samples:          2984
evals/sample:     1

julia> @benchmark fun3(\$r,\$od,\$γ,\$e)
BenchmarkTools.Trial:
memory estimate:  175.89 KiB
allocs estimate:  2
--------------
minimum time:     237.100 μs (0.00% GC)
median time:      258.300 μs (0.00% GC)
mean time:        274.366 μs (1.78% GC)
maximum time:     3.606 ms (91.37% GC)
--------------
samples:          10000
evals/sample:     1
``````
2 Likes

As an aside, you don’t need to specify the exact types in the function signature. For example, you pass `γ = 2.0` which you could just as an integer 2 instead. To allow both cases, relax the signature from:

``````fun(r::Matrix, od::Int64, γ::Float64, e::Float64)
``````

to

``````fun(r::Matrix, od::Integer, γ::Real, e::Real)
``````

This would allow you to pass e.g. `BigFloat` for increased precision (though for that to work you would also need to tweak how you allocate the `p` output matrix).

2 Likes

It’s possible to also use `@threads` with `@avx`, though I saw no speedup for these array sizes.

You can shave off a few percent, however, by writing `p = similar(r)` instead of `p = zero(r)`.

Here is one compact way to add multi-threading which seems to help quite a bit:

``````r150 = rand(150,150); # same functions as above:
@btime fun1(\$r150, 7, 2.0, 1.0); # 488.889 μs (2 allocations: 175.89 KiB)
@btime fun2(\$r150, 7, 2.0, 1.0); # 491.793 μs (2 allocations: 175.89 KiB)

using Einsum
# the same loops as fun2, just notation:
fun_ein(r, od, γ, e) = @einsum p[i,j] := γ * r[i,j]^od + exp(-(r[i,j]*e)^2)
fun_viel(r, od, γ, e) = @vielsum p[i,j] := γ * r[i,j]^od + exp(-(r[i,j]*e)^2)

fun_ein(r150, 7, 2.0, 1.0) ≈ fun1(r150, 7, 2.0, 1.0) # true

@btime fun_ein(\$r150, 7, 2.0, 1.0);  # 478.953 μs (2 allocations: 175.89 KiB)
@btime fun_viel(\$r150, 7, 2.0, 1.0); #  93.681 μs (65 allocations: 183.95 KiB)
``````

This is on a 6-core machine, YMMV.

I don’t see as much speedup from LoopVectorization as @jebej did. But trying out an (un-registered, still-unreliable) package which uses that:

``````using LoopVectorization, Tullio
@btime fun3(\$r150, 7, 2.0, 1.0); # 261.102 μs (2 allocations: 175.89 KiB)

# this also uses @avx:
fun_tul(r, od, γ, e) = @tullio p[i,j] := \$γ * r[i,j]^(\$od) + exp(-(r[i,j]*\$e)^2)
# lower threshold to make it @spawn threads
fun_tul2(r, od, γ, e) = @tullio p[i,j] := \$γ * r[i,j]^(\$od) + exp(-(r[i,j]*\$e)^2) threads=2^12

@btime fun_tul(\$r150, 7, 2.0, 1.0);  # 256.528 μs (8 allocations: 176.08 KiB
@btime fun_tul2(\$r150, 7, 2.0, 1.0); #  62.490 μs (168 allocations: 188.83 KiB)
``````
1 Like

Another thought is that you can easily do this with broadcasting. The basic version is the same speed as `fun1`, and:

``````using LoopVectorization
fun_avx(r, od, γ, e) = @avx @. γ * r^od + exp(-(r*e)^2)

fun_str(r, od, γ, e) = @strided γ .* r.^od .+ exp.(.-(r.*e).^2)

@btime fun_avx(\$r150, 7, 2.0, 1.0); # 258.900 μs (2 allocations: 175.89 KiB)
@btime fun_str(\$r150, 7, 2.0, 1.0); #  99.693 μs (91 allocations: 187.45 KiB)
``````

And finally, here’s how to get SIMD to work without loading packages:

``````function fun4(r::Matrix, od::Int64, γ::Float64, e::Float64)
p = similar(r);
@fastmath @simd for i in eachindex(r)
@inbounds p[i] = r[i]^od*γ + exp(-(r[i]*e)^2)
end
return p
end
@btime fun4(\$r150, 7, 2.0, 1.0); # 198.393 μs (2 allocations: 175.89 KiB)
``````
2 Likes

My machine is also 6 core. I do not understand but, Einsum is not providing speedup for me. Here are the results

``````@benchmark fun_viel(\$r, 7, 2.0, 1.0)
BenchmarkTools.Trial:
memory estimate:  175.95 KiB
allocs estimate:  3
--------------
minimum time:     1.294 ms (0.00% GC)
median time:      1.298 ms (0.00% GC)
mean time:        1.325 ms (0.54% GC)
maximum time:     2.559 ms (43.18% GC)
--------------
samples:          3769
evals/sample:     1
``````

LoopVectorization worked for me. Thank you.

What does `Threads.nthreads()` return? For multi-threading you have to start Julia with an environment variable, whose spelling I forget but it is in the manual.

This is probably because of `Base.literal_pow`, which for power 3 should become `x*x*x` instead of `x^p`. For example:

``````p3(x) = x^3
@code_lowered p3(2.0)
@code_llvm p3(2.0)
pn(x,n) = x^n
@code_lowered pn(2.0, 3)
@code_llvm pn(2.0, 3)
``````
1 Like

Thank you for point out `Base.literal_pow` ,
I use Atom ide and I set number of threads to 6 in settings. And now it retruns 6. But still the function runs slower both in Atom and Julia terminal.

I don’t know, that’s weird. Do other threaded things work? Is the threaded one still slower for say `rand(2000,2000)` instead? Perhaps in other versions of Julia?

I tried for `rand(2000,2000)`. Here are timings

``````@benchmark fun_str(\$r,\$od,\$γ,\$e)
BenchmarkTools.Trial:
memory estimate:  30.52 MiB
allocs estimate:  57
--------------
minimum time:     40.636 ms (0.00% GC)
median time:      41.096 ms (0.00% GC)
mean time:        41.350 ms (0.49% GC)
maximum time:     47.168 ms (0.00% GC)
--------------
samples:          121
evals/sample:     1
``````
``````@benchmark fun3(\$r,\$od,\$γ,\$e)
BenchmarkTools.Trial:
memory estimate:  30.52 MiB
allocs estimate:  2
--------------
minimum time:     46.655 ms (0.00% GC)
median time:      47.504 ms (0.00% GC)
mean time:        47.680 ms (0.50% GC)
maximum time:     49.086 ms (0.00% GC)
--------------
samples:          105
evals/sample:     1
``````
`````` @benchmark fun_viel(\$r,\$od,\$γ,\$e)
BenchmarkTools.Trial:
memory estimate:  30.52 MiB
allocs estimate:  35
--------------
minimum time:     41.211 ms (0.00% GC)
median time:      41.974 ms (0.00% GC)
mean time:        42.253 ms (0.50% GC)
maximum time:     47.335 ms (0.00% GC)
--------------
samples:          119
evals/sample:     1
``````

For large matrices, threaded looks better.

Infinitesimally better, you’d expect close to 6x. I have no idea why it’s not working; something is happening since you are seeing a few more allocations. Does this print different numbers?

``````Threads.@threads for i in 1:10
end
``````

Yes,

``````Threads.threadid() = 2
Does the fastmath version of `exp` SIMD? If so, that’s great to know 