Accelerate pairwise damage computation

Can someone help me to accelerate the pairwise damage computation? This problem is very similar to the one in Accelerate pairwise L-J force computation. And I adopted the conclusion that the pairlist should be sorted first to have a better performance. The following codes are modified based on the one in Accelerate pairwise Lennard-Jones force computation given by @ lmiq

using BenchmarkTools
using StaticArrays
const Ξ» = 1.0E-6
@inline @fastmath function damage_pair!(f,x,y,i,j,d)
    new_d = 0.0
    for idim = 1:2
        new_d += (x[idim] - y[idim]) ^ 2
    end
    new_d = sqrt(new_d)
    str = new_d - d
    strhis = 0.0
    pairdam = 0.0
    if str > strhis
         strhis = str
    end
    if strhis > Ξ»
        pairdam = 1.0 - exp(-1000.0*(str-Ξ»))
    end
    @inbounds f[i] += pairdam
    @inbounds f[j] += pairdam
    return f    
end
function cal_damage!(f,x,pairlist,pairlength)
    for ipair in 1:length(pairlist)
        i = pairlist[ipair][1]
        j = pairlist[ipair][2]
        d = pairlength[ipair]
        f = damage_pair!(f,x[i],x[j],i,j,d)
    end
    return f
end
n = 10_000 # number of particles
npairs = 8026946 # number of pairs within cutoff
pairlist = [ SVector{2,Int}(rand(1:n),rand(1:n)) for _ in 1:npairs ]
sort!(pairlist)
pairlength = [rand() for _ in 1:npairs]
x = [rand(SVector{2,Float64}) for _ in 1:n]
f = zeros(Float64,n)
@benchmark cal_damage!(f,x,pairlist,pairlength)

The performance is 130ms and much slower than 31ms in Accelerate pairwise Lennard-Jones force computation

BenchmarkTools.Trial: 39 samples with 1 evaluation.
 Range (min … max):  130.446 ms … 133.848 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     131.095 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   131.420 ms Β± 875.379 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–ƒ β–ˆ β–ƒβ–ƒβ–ƒβ–ƒβ–ˆβ–ƒ β–ƒ  β–ƒ β–ƒ           β–ƒ
  β–‡β–β–ˆβ–‡β–ˆβ–β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–ˆβ–β–ˆβ–β–β–β–‡β–‡β–β–β–β–β–‡β–β–ˆβ–β–β–β–β–β–β–β–β–β–‡β–β–β–β–β–β–β–β–‡β–β–β–β–‡β–β–β–β–β–β–‡β–β–‡ ▁
  130 ms           Histogram: frequency by time          134 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Then I commented out the exp in the pairdam

pairdam = 1.0 #- exp(-1000.0*str)

The result is 37ms, which means evaluating the exp function takes 100ms!

BenchmarkTools.Trial: 132 samples with 1 evaluation.
 Range (min … max):  36.775 ms … 44.848 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     37.599 ms              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   37.924 ms Β±  1.224 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

     β–ƒβ–„β–ˆβ–…β–…β–β–
  β–ƒβ–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–β–ƒβ–…β–ƒβ–β–β–β–β–β–β–β–ƒβ–ƒβ–β–β–β–ƒβ–ƒβ–β–β–β–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ β–ƒ
  36.8 ms         Histogram: frequency by time        44.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

Then I commented out two if in the function damage_pair!

@inline @fastmath function damage_pair!(f,x,y,i,j,d)
    new_d = 0.0
    for idim = 1:2
        new_d += (x[idim] - y[idim]) ^ 2
    end
    new_d = sqrt(new_d)
    str = new_d - d
    strhis = 0.0
    pairdam = 0.0
    # if str > strhis
        strhis = str
    # end
    # if strhis > Ξ»
        pairdam = 1.0 #- exp(-1000.0*(str-Ξ»))
    # end
    @inbounds f[i] += pairdam
    @inbounds f[j] += pairdam
    return f    
end

The result is 22ms.

BenchmarkTools.Trial: 222 samples with 1 evaluation.
 Range (min … max):  22.185 ms …  25.645 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     22.393 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   22.545 ms Β± 456.674 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ˆβ–ˆβ–…β–‡β–   
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–‡β–„β–†β–ƒβ–„β–„β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–ƒβ–‚β–β–‚β–β–β–‚β–β–β–‚β–‚β–‚β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚ β–ƒ
  22.2 ms         Histogram: frequency by time         24.9 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

As shown in above, evaluating the exp function takes 100ms, and two if takes 15ms. I have tried to put the exp into a function, but the performance is the same, and using the 2.718281828459045^(-1000.0*(strhis-Ξ»)) as shown in Slow exp()? , the performance is worse, around 340ms.

Is there any possible way to accelerate the computation of exp and evaluation of if ?

A late response. I’ve only found a small improvement using conditional updates of f

function cal_damage2!(f,x,pairs)
    @fastmath @inbounds for (i,j,d) in pairs
        str = norm(x[i]-x[j], 2) - d
        if str >  Ξ»
            pairdam = 1.0 - exp(-1000.0*(str-Ξ»))
            f[i] += pairdam
            f[j] += pairdam
        end
    end
    return f
end

function cal_damage_wo_exp!(f,x,pairs)
    @fastmath @inbounds for (i,j,d) in pairs
        str = norm(x[i]-x[j], 2) - d
        if str >  Ξ»
            f[i] += 1.0
            f[j] += 1.0
        end
    end
    return f
end

function setup()
    n = 10_000 # number of particles
    npairs = 8026946 # number of pairs within cutoff
    pairs = [ (rand(1:n),rand(1:n),rand()) for _ in 1:npairs ]
    sort!(pairs)
    x = [rand(SVector{2,Float64}) for _ in 1:n]
    f = zeros(Float64,n)
    f,x,pairs
end

(f,x,pairs)=setup()
cal_damage!(f,x,pairs)
f1 = copy(f)
fill!(f, 0.0)
cal_damage2!(f,x,pairs)
f2 = copy(f)
@assert f2 β‰ˆ f1

(f,x,pairs)=setup()
@btime cal_damage!($f,$x,$pairs) setup=(fill!($f, 0.0))
@btime cal_damage2!($f,$x,$pairs) setup=(fill!($f, 0.0))
@btime cal_damage_wo_exp!($f,$x,$pairs) setup=(fill!($f, 0.0))

yielding

  106.239 ms (0 allocations: 0 bytes) # cal_damage!
  96.990 ms (0 allocations: 0 bytes) # cal_damage2!
  77.479 ms (0 allocations: 0 bytes) # without exp

For me the performance of exp isn’t that bad. Nevertheless here is a stack overflow discussion about ways to implement fast exp. I don’t know if these are better than Julia’s (@fastmath) implementation.

Thanks for your reply! Conditional updates of f do improve the performance. I tried your code and the performance is similar to your post. But the improvement is random according to the generated random numbers. In addition, when you put the code in the for-loop into a function like what I did, the evaluation of exp() do take a lot of time. This is however strange to me.

Thanks for your reply! I tried the ifelse statement, that do improve the performance.
If commented out the exp(), the performance is close to simply loop all the pairs and add a fixed number.

LoopVectorization has a really fast vectorized exp impl. (Julia’s @fastmath exp is really fast for an accurate scalar version)

3 Likes

I deleted the post because there was an error, which results from the fact that ifelse evaluates both arguments, thus updating an array using it is wrong:

julia> x = [0,0,0]
       for i in 1:3
           ifelse(true, x[1] += 1, x[1] -= 1)
       end
       x
3-element Vector{Int64}:
 0
 0
 0

when I fixed that in the code, the performance I got was the same the code with the standard conditional, because the fundamental difference is being able or not to access the elements of the array unconditionally (allowing SIMD).

This one is correct and apparently slightly faster:

function cal_damage_wo_exp_ifelse!(f,x,pairs)
    Ξ» = 1.0E-6
    @fastmath @inbounds for (i,j,d) in pairs
        str = norm(x[i]-x[j], 2) - d
        pairdam = ifelse(str > Ξ», 1.0 - exp(-1000.0*(str-Ξ»)), 0.)
        f[i] += pairdam
        f[j] += pairdam
    end
    return f
end

This is about 40% faster than your version and a bit faster than the versions suggested in this thread. My idea is to separate the calculation of exp to enable vectorization, that’s, trade some memory for speed. Tullio and LoopVectorization play nicely together:

# 113.245 ms (0 allocations: 0 bytes)     Original version
# 104.034 ms (0 allocations: 0 bytes)     @goerch version
# 96.728 ms (0 allocations: 0 bytes)      @lmiq version
# 75.334 ms (119 allocations: 122.49 MiB) New version

Here is my version:

using StaticArrays, Tullio, LoopVectorization
function cal_damage_sep_loops!(f,x,pairs)
    Ξ» = 1.0E-6
    str_vec = [norm(x[i]-x[j]) - d for (i,j,d) in pairs] 
    @tullio exp_vec[k] := 1 - exp(-1000*(str_vec[k]-Ξ»))
    for k = 1:length(pairs)
        pairdam = ifelse(str_vec[k] > Ξ», exp_vec[k], 0.0)
        i, j = pairs[k][1], pairs[k][2]
        f[i] += pairdam
        f[j] += pairdam
    end
end
function setup()
    n = 10_000       # number of particles
    npairs = 8026946 # number of pairs within cutoff
    pairs = [ (rand(1:n),rand(1:n),rand()) for _ in 1:npairs ]
    sort!(pairs)
    x = [rand(SVector{2,Float64}) for _ in 1:n]
    f = zeros(n)
    f, x, pairs
end
f, x, pairs = setup()
@btime cal_damage_sep_loops!($f,$x,$pairs)
1 Like

Thanks a lot for pointing out this and explain the mechanisms behind it. I have really learned something. In my PC, this verison has a 10ms improvement than conditionally update f.

Thanks for your reply! Your version do improve the performance a lot. I wanna know if this improvement had something to do with what @Oscar_Smith mentioned.
The memory allocation is close to solving a systems of linear equations with 20000 freedoms, however. In my simulation program, the damage is calculated every iteration. And thus the gc time will increase accordingly. Is there any possible solutions?

To avoid accumulated memory allocations, pre-allocate the vectors str_vec and exp_vec outside the function and reuse the allocated space inside the function as many times as you wish, no new memory will be allocated. Use the version of @tullio without the :=, that’s, just write @tullio exp_vec[k] = ... And yes, what @Oscar_Smith has mentioned is correct, the vectorized exp version from LoopVectorization is much faster.

3 Likes

Thanks a lot. I tried it but there are still extra memory allocations in

str_vec .= [norm(x[i]-x[j]) - d for (i,j,d) in pairs]

I don’t quite understand, there should be no allocation with the . operator. Then I unroll it into explicit for loop, then the memory allocations go to zero. Here is the final code.

using StaticArrays, Tullio, LoopVectorization, BenchmarkTools,LinearAlgebra
function cal_damage_sep_loops!(f,x,pairs,str_vec,exp_vec)
    Ξ» = 1.0E-6
    # @time str_vec .= [norm(x[i]-x[j]) - d for (i,j,d) in pairs]
    ithpair = 0
    for (i,j,d) in pairs
        ithpair += 1
        str_vec[ithpair] = norm(x[i]-x[j],2) - d
    end
    # @tullio exp_vec[k] := 1 - exp(-1000*(str_vec[k]-Ξ»))
    @tullio exp_vec[k] = 1.0 - exp(-1000.0*(str_vec[k]-Ξ»))
   for k = 1:length(pairs)
        pairdam = ifelse(str_vec[k] > Ξ», exp_vec[k], 0.0)
        i, j = pairs[k][1], pairs[k][2]
        f[i] += pairdam
        f[j] += pairdam
    end
end
function setup()
    n = 10_000       # number of particles
    npairs = 8026946 # number of pairs within cutoff
    pairs = [ (rand(1:n),rand(1:n),rand()) for _ in 1:npairs ]
    sort!(pairs)
    x = [rand(SVector{2,Float64}) for _ in 1:n]
    f = zeros(n)
    str_vec = zeros(Float64,npairs)
    exp_vec = zeros(Float64,npairs)
    f, x, pairs,str_vec,exp_vec
end
f, x, pairs, str_vec, exp_vec = setup()
@btime cal_damage_sep_loops!($f,$x,$pairs,$str_vec,$exp_vec)

The performance is

67.343 ms (0 allocations: 0 bytes)

The comprehension in the right allocates a new array.

1 Like

Thank a lot! I get it :smiley: