Accelerate pairwise damage computation

Can someone help me to accelerate the pairwise damage computation? This problem is very similar to the one in Accelerate pairwise L-J force computation. And I adopted the conclusion that the pairlist should be sorted first to have a better performance. The following codes are modified based on the one in Accelerate pairwise Lennard-Jones force computation given by @ lmiq

``````using BenchmarkTools
using StaticArrays
const Ξ» = 1.0E-6
@inline @fastmath function damage_pair!(f,x,y,i,j,d)
new_d = 0.0
for idim = 1:2
new_d += (x[idim] - y[idim]) ^ 2
end
new_d = sqrt(new_d)
str = new_d - d
strhis = 0.0
pairdam = 0.0
if str > strhis
strhis = str
end
if strhis > Ξ»
pairdam = 1.0 - exp(-1000.0*(str-Ξ»))
end
@inbounds f[i] += pairdam
@inbounds f[j] += pairdam
return f
end
function cal_damage!(f,x,pairlist,pairlength)
for ipair in 1:length(pairlist)
i = pairlist[ipair][1]
j = pairlist[ipair][2]
d = pairlength[ipair]
f = damage_pair!(f,x[i],x[j],i,j,d)
end
return f
end
n = 10_000 # number of particles
npairs = 8026946 # number of pairs within cutoff
pairlist = [ SVector{2,Int}(rand(1:n),rand(1:n)) for _ in 1:npairs ]
sort!(pairlist)
pairlength = [rand() for _ in 1:npairs]
x = [rand(SVector{2,Float64}) for _ in 1:n]
f = zeros(Float64,n)
@benchmark cal_damage!(f,x,pairlist,pairlength)

``````

The performance is 130ms and much slower than 31ms in Accelerate pairwise Lennard-Jones force computation

``````BenchmarkTools.Trial: 39 samples with 1 evaluation.
Range (min β¦ max):  130.446 ms β¦ 133.848 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     131.095 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   131.420 ms Β± 875.379 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

β β ββββββ β  β β           β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
130 ms           Histogram: frequency by time          134 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Then I commented out the `exp` in the pairdam

``````pairdam = 1.0 #- exp(-1000.0*str)
``````

The result is 37ms, which means evaluating the `exp` function takes 100ms!

``````BenchmarkTools.Trial: 132 samples with 1 evaluation.
Range (min β¦ max):  36.775 ms β¦ 44.848 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     37.599 ms              β GC (median):    0.00%
Time  (mean Β± Ο):   37.924 ms Β±  1.224 ms  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
36.8 ms         Histogram: frequency by time        44.4 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

Then I commented out two `if` in the function `damage_pair!`

``````@inline @fastmath function damage_pair!(f,x,y,i,j,d)
new_d = 0.0
for idim = 1:2
new_d += (x[idim] - y[idim]) ^ 2
end
new_d = sqrt(new_d)
str = new_d - d
strhis = 0.0
pairdam = 0.0
# if str > strhis
strhis = str
# end
# if strhis > Ξ»
pairdam = 1.0 #- exp(-1000.0*(str-Ξ»))
# end
@inbounds f[i] += pairdam
@inbounds f[j] += pairdam
return f
end
``````

The result is 22ms.

``````BenchmarkTools.Trial: 222 samples with 1 evaluation.
Range (min β¦ max):  22.185 ms β¦  25.645 ms  β GC (min β¦ max): 0.00% β¦ 0.00%
Time  (median):     22.393 ms               β GC (median):    0.00%
Time  (mean Β± Ο):   22.545 ms Β± 456.674 ΞΌs  β GC (mean Β± Ο):  0.00% Β± 0.00%

βββββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
22.2 ms         Histogram: frequency by time         24.9 ms <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

As shown in above, evaluating the `exp` function takes 100ms, and two `if` takes 15ms. I have tried to put the exp into a function, but the performance is the same, and using the `2.718281828459045^(-1000.0*(strhis-Ξ»))` as shown in Slow exp()? , the performance is worse, around 340ms.

Is there any possible way to accelerate the computation of `exp` and evaluation of `if` ?

A late response. Iβve only found a small improvement using conditional updates of `f`

``````function cal_damage2!(f,x,pairs)
@fastmath @inbounds for (i,j,d) in pairs
str = norm(x[i]-x[j], 2) - d
if str >  Ξ»
pairdam = 1.0 - exp(-1000.0*(str-Ξ»))
f[i] += pairdam
f[j] += pairdam
end
end
return f
end

function cal_damage_wo_exp!(f,x,pairs)
@fastmath @inbounds for (i,j,d) in pairs
str = norm(x[i]-x[j], 2) - d
if str >  Ξ»
f[i] += 1.0
f[j] += 1.0
end
end
return f
end

function setup()
n = 10_000 # number of particles
npairs = 8026946 # number of pairs within cutoff
pairs = [ (rand(1:n),rand(1:n),rand()) for _ in 1:npairs ]
sort!(pairs)
x = [rand(SVector{2,Float64}) for _ in 1:n]
f = zeros(Float64,n)
f,x,pairs
end

(f,x,pairs)=setup()
cal_damage!(f,x,pairs)
f1 = copy(f)
fill!(f, 0.0)
cal_damage2!(f,x,pairs)
f2 = copy(f)
@assert f2 β f1

(f,x,pairs)=setup()
@btime cal_damage!(\$f,\$x,\$pairs) setup=(fill!(\$f, 0.0))
@btime cal_damage2!(\$f,\$x,\$pairs) setup=(fill!(\$f, 0.0))
@btime cal_damage_wo_exp!(\$f,\$x,\$pairs) setup=(fill!(\$f, 0.0))
``````

yielding

``````  106.239 ms (0 allocations: 0 bytes) # cal_damage!
96.990 ms (0 allocations: 0 bytes) # cal_damage2!
77.479 ms (0 allocations: 0 bytes) # without exp
``````

For me the performance of `exp` isnβt that bad. Nevertheless here is a stack overflow discussion about ways to implement fast `exp`. I donβt know if these are better than Juliaβs (`@fastmath`) implementation.

Thanks for your reply! Conditional updates of `f` do improve the performance. I tried your code and the performance is similar to your post. But the improvement is random according to the generated random numbers. In addition, when you put the code in the for-loop into a function like what I did, the evaluation of `exp()` do take a lot of time. This is however strange to me.

Thanks for your reply! I tried the `ifelse` statement, that do improve the performance.
If commented out the `exp()`, the performance is close to simply loop all the pairs and add a fixed number.

LoopVectorization has a really fast vectorized `exp` impl. (Juliaβs `@fastmath exp` is really fast for an accurate scalar version)

3 Likes

I deleted the post because there was an error, which results from the fact that `ifelse` evaluates both arguments, thus updating an array using it is wrong:

``````julia> x = [0,0,0]
for i in 1:3
ifelse(true, x[1] += 1, x[1] -= 1)
end
x
3-element Vector{Int64}:
0
0
0

``````

when I fixed that in the code, the performance I got was the same the code with the standard conditional, because the fundamental difference is being able or not to access the elements of the array unconditionally (allowing SIMD).

This one is correct and apparently slightly faster:

``````function cal_damage_wo_exp_ifelse!(f,x,pairs)
Ξ» = 1.0E-6
@fastmath @inbounds for (i,j,d) in pairs
str = norm(x[i]-x[j], 2) - d
pairdam = ifelse(str > Ξ», 1.0 - exp(-1000.0*(str-Ξ»)), 0.)
f[i] += pairdam
f[j] += pairdam
end
return f
end
``````

This is about 40% faster than your version and a bit faster than the versions suggested in this thread. My idea is to separate the calculation of `exp` to enable vectorization, thatβs, trade some memory for speed. Tullio and LoopVectorization play nicely together:

``````# 113.245 ms (0 allocations: 0 bytes)     Original version
# 104.034 ms (0 allocations: 0 bytes)     @goerch version
# 96.728 ms (0 allocations: 0 bytes)      @lmiq version
# 75.334 ms (119 allocations: 122.49 MiB) New version
``````

Here is my version:

``````using StaticArrays, Tullio, LoopVectorization
function cal_damage_sep_loops!(f,x,pairs)
Ξ» = 1.0E-6
str_vec = [norm(x[i]-x[j]) - d for (i,j,d) in pairs]
@tullio exp_vec[k] := 1 - exp(-1000*(str_vec[k]-Ξ»))
for k = 1:length(pairs)
pairdam = ifelse(str_vec[k] > Ξ», exp_vec[k], 0.0)
i, j = pairs[k][1], pairs[k][2]
f[i] += pairdam
f[j] += pairdam
end
end
function setup()
n = 10_000       # number of particles
npairs = 8026946 # number of pairs within cutoff
pairs = [ (rand(1:n),rand(1:n),rand()) for _ in 1:npairs ]
sort!(pairs)
x = [rand(SVector{2,Float64}) for _ in 1:n]
f = zeros(n)
f, x, pairs
end
f, x, pairs = setup()
@btime cal_damage_sep_loops!(\$f,\$x,\$pairs)
``````
1 Like

Thanks a lot for pointing out this and explain the mechanisms behind it. I have really learned something. In my PC, this verison has a `10ms` improvement than conditionally update `f`.

Thanks for your reply! Your version do improve the performance a lot. I wanna know if this improvement had something to do with what @Oscar_Smith mentioned.
The memory allocation is close to solving a systems of linear equations with 20000 freedoms, however. In my simulation program, the damage is calculated every iteration. And thus the `gc` time will increase accordingly. Is there any possible solutions?

To avoid accumulated memory allocations, pre-allocate the vectors `str_vec` and `exp_vec` outside the function and reuse the allocated space inside the function as many times as you wish, no new memory will be allocated. Use the version of `@tullio` without the `:=`, thatβs, just write `@tullio exp_vec[k] = ..`. And yes, what @Oscar_Smith has mentioned is correct, the vectorized `exp` version from `LoopVectorization` is much faster.

3 Likes

Thanks a lot. I tried it but there are still extra memory allocations in

``````str_vec .= [norm(x[i]-x[j]) - d for (i,j,d) in pairs]
``````

I donβt quite understand, there should be no allocation with the `.` operator. Then I unroll it into explicit `for` loop, then the memory allocations go to zero. Here is the final code.

``````using StaticArrays, Tullio, LoopVectorization, BenchmarkTools,LinearAlgebra
function cal_damage_sep_loops!(f,x,pairs,str_vec,exp_vec)
Ξ» = 1.0E-6
# @time str_vec .= [norm(x[i]-x[j]) - d for (i,j,d) in pairs]
ithpair = 0
for (i,j,d) in pairs
ithpair += 1
str_vec[ithpair] = norm(x[i]-x[j],2) - d
end
# @tullio exp_vec[k] := 1 - exp(-1000*(str_vec[k]-Ξ»))
@tullio exp_vec[k] = 1.0 - exp(-1000.0*(str_vec[k]-Ξ»))
for k = 1:length(pairs)
pairdam = ifelse(str_vec[k] > Ξ», exp_vec[k], 0.0)
i, j = pairs[k][1], pairs[k][2]
f[i] += pairdam
f[j] += pairdam
end
end
function setup()
n = 10_000       # number of particles
npairs = 8026946 # number of pairs within cutoff
pairs = [ (rand(1:n),rand(1:n),rand()) for _ in 1:npairs ]
sort!(pairs)
x = [rand(SVector{2,Float64}) for _ in 1:n]
f = zeros(n)
str_vec = zeros(Float64,npairs)
exp_vec = zeros(Float64,npairs)
f, x, pairs,str_vec,exp_vec
end
f, x, pairs, str_vec, exp_vec = setup()
@btime cal_damage_sep_loops!(\$f,\$x,\$pairs,\$str_vec,\$exp_vec)
``````

The performance is

``````67.343 ms (0 allocations: 0 bytes)
``````

The comprehension in the right allocates a new array.

1 Like

Thank a lot! I get it