# Could anyone make this run faster?

Hello!

I apologize for using unicode characters before hand, but I really like it I have a function given as:

``````function ∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
# Skip distances outside the support of the kernel:
if q < 0.0 || q > 2.0
return SVector(0.0,0.0,0.0)
end

gradWx = αD * 1/h * (5*(q-2)^3*q)/8 * (xᵢⱼ[1] / (q*h+1e-6))
gradWy = αD * 1/h * (5*(q-2)^3*q)/8 * (xᵢⱼ[2] / (q*h+1e-6))
gradWz = αD * 1/h * (5*(q-2)^3*q)/8 * (xᵢⱼ[3] / (q*h+1e-6))

return SVector(gradWx,gradWy,gradWz)
end
``````

Which by using BenchmarkTools has a performance as such:

``````αD   = 174
q    = 1
h    = 0.5659
xᵢⱼ  = rand(SVector{3,Float64})

@benchmark ∇ᵢWᵢⱼ(\$αD,\$q,\$xᵢⱼ,\$h)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max):  6.100 ns … 115.700 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     6.300 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   6.382 ns ±   1.985 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

█
▂▁▁▁▁▅▁▁▁▁▁█▁▁▁▁▁▃▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▂ ▂
6.1 ns          Histogram: frequency by time         7.1 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

And 6.1 ns is really fast, but any improvement can be of use, since this is a function which I have to run extremely often. I made some improvements to it by combining everything into a factor:

``````function Optim∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
# Skip distances outside the support of the kernel:
if q < 0.0 || q > 2.0
return SVector(0.0,0.0,0.0)
end

Fac    = αD * 1/h * (5*(q-2)^3*q)/8 * (1/(q*h+1e-6))

gradWx = Fac * xᵢⱼ[1]
gradWy = Fac * xᵢⱼ[2]
gradWz = Fac * xᵢⱼ[3]

return SVector(gradWx,gradWy,gradWz)
end
``````

Which produces an improvement of ~8%:

``````@benchmark Optim∇ᵢWᵢⱼ(\$αD,\$q,\$xᵢⱼ,\$h)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max):  5.600 ns … 149.000 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     5.700 ns               ┊ GC (median):    0.00%
Time  (mean ± σ):   5.967 ns ±   2.516 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

█▅                                                          ▁
███▅▇▅▅▄▃▄▁▁▁▁▄▁▁▁▁▁▁▃▁▄▄▅▄▄▅▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▅▆ █
5.6 ns       Histogram: log(frequency) by time      20.1 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

I’ve seen some of you being able to speed things up in an (to me) incredible way, so I just felt it would be stupid of me not to ask, if someone would know how to make this calculation faster

Kind regards

3 Likes

You can add `@fastmath`. Also note that you can just multiply `Fac` by `x`, to simplify the code:

``````julia> function Optim∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
# Skip distances outside the support of the kernel:
if q < 0.0 || q > 2.0
return SVector(0.0,0.0,0.0)
end
Fac    = @fastmath αD * 1/h * (5*(q-2)^3*q)/8 * (1/(q*h+1e-6))
return Fac * xᵢⱼ
end
Optim∇ᵢWᵢⱼ (generic function with 1 method)
``````

That’s sligthly faster here.

2 Likes

Probably not faster, but more elegant and general:

``````function Optim∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
# Skip distances outside the support of the kernel:
if 0 <= q <= 2
Fac = αD*5*(q-2)^3*q / (8h*(q*h+1e-6))
else
Fac = 0.0 # or return zero(xᵢⱼ)
end
return Fac .* xᵢⱼ
end
``````
1 Like

I would guess you are approaching the theoretical maximum performance for running this one function, especially after `@fastmath`. It’s only 2.5ns on my machine.

So most of the remaining gains will be in how you can parallelize it in the calling function.

For example running it on a GPU can be easily 30x faster for a large array:

``````julia> using CUDA

julia> xᵢⱼ = [rand(SVector{3,Float64}) for i in 1:100_000_000];

julia> cu_xᵢⱼ = CuArray(xᵢⱼ);

julia> @benchmark Optim∇ᵢWᵢⱼ.(\$αD,\$q,\$xᵢⱼ,\$h)
BenchmarkTools.Trial: 6 samples with 1 evaluation.
Range (min … max):  808.936 ms …    1.161 s  ┊ GC (min … max):  0.17% … 17.53%
Time  (median):     861.737 ms               ┊ GC (median):     9.89%
Time  (mean ± σ):   910.588 ms ± 131.796 ms  ┊ GC (mean ± σ):  10.66% ±  8.91%

█ █     ██             █                                    █
█▁█▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
809 ms           Histogram: frequency by time          1.16 s <

Memory estimate: 2.24 GiB, allocs estimate: 2.

julia> @benchmark CUDA.@sync Optim∇ᵢWᵢⱼ.(\$αD,\$q,\$cu_xᵢⱼ,\$h)
BenchmarkTools.Trial: 160 samples with 1 evaluation.
Range (min … max):  21.589 ms … 33.286 ms  ┊ GC (min … max): 0.00% … 8.57%
Time  (median):     31.562 ms              ┊ GC (median):    8.11%
Time  (mean ± σ):   31.223 ms ±  1.467 ms  ┊ GC (mean ± σ):  8.03% ± 1.18%

▂▁    ▂▄█▆
▃▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▆███▅▄▅█████▄▅▃▃ ▃
21.6 ms         Histogram: frequency by time        33.1 ms <

Memory estimate: 4.80 KiB, allocs estimate: 85.

``````
2 Likes

Thank you both @lmiq and @DNF !

I tried the @fastmath macro, but it did not seem to have any effect on my system, so I removed it again.

I gave your version, DNF, a try, and actually yours is much faster !

``````function Optim∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
# Skip distances outside the support of the kernel:
if 0 < q < 2
Fac = αD*5*(q-2)^3*q / (8h*(q*h+1e-6))
else
Fac = 0.0 # or return zero(xᵢⱼ)
end
return Fac .* xᵢⱼ
end

@benchmark Optim∇ᵢWᵢⱼ(\$αD,\$q,\$xᵢⱼ,\$h)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max):  2.500 ns … 31.300 ns  ┊ GC (min … max): 0.00% … 0.00%
Time  (median):     2.700 ns              ┊ GC (median):    0.00%
Time  (mean ± σ):   2.863 ns ±  1.398 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

█
▂▁▁▁▁▁▃▁▁▁▁▁█▁▁▁▁▁▁▅▁▁▁▁▁▂▁▁▁▁▁▁▄▁▁▁▁▁▂▁▁▁▁▁▁▂▁▁▁▁▁▂▁▁▁▁▁▂ ▂
2.5 ns         Histogram: frequency by time         3.4 ns <

Memory estimate: 0 bytes, allocs estimate: 0.
``````

This is about a 50% speed up compared to base-line. I am not sure why this difference is this big, but I assume the way you multiply constants together makes it easier for the compiler perhaps?

Kind regards

2 Likes

Thank you for this!

One of my aims is also running my particle simulation code (which this function above is a part of) on the GPU, but part of that journey is making sure I understand fully how a CPU versions works.

I have already tried GPU, but I wrote it horribly so it ended up 10x slower than CPU. Will make sure to ask when I am ready for that again

Kind regards

1 Like

This version will still be much slower for small loads on a GPU - you need a large task to see the kind of gains I posted.

3 Likes

Probably that’s because less divisions are performed. Divisions are expensive.

2 Likes

Perhaps it’s that mine has just one division instead of three. And the `@fastmath` macro perhaps gives the compiler permission to reorder the expression similarly.

2 Likes

Great, thank you all

As a final note, @DNF, using the form provided by you, my test case went from 30 to ~21-22 minutes, so roughly a 33% improvement, which was really nice!

Kind regards

4 Likes

In an effort to make calculation even faster, what can you tell us about the values of `h`? Are they different in every call, or do they represent a stable parameter or hyperparameter?

As for `q`, what is its meaning? When does it get very small i.e. < 1e-5 ?

1 Like

h is in this case a constant parameter called the “Smoothing length” which determines the “radius of influence” material points have on each other in Smoothed Particle Hydrodynamics.

The normalized distance is then given as q which is defined as: r/h, where r is the Euclidean distance between two points.

q is a parameter which goes into a kernel function to estimate “amount of influence” - the kernel is defined in such a way that it only has an influence on another material point if the distance is between 0 to 2, i.e. 0 < q < 2.

This condition is in a sense fulfilled by the neighbour searching algorithm, but as a safety precaution I like keeping it in.

q gets extremely small when the same material point is considered twice or two points are exceptionally close, and thereby have a large influence on each other.

Kind regards

How common is the case when `q` is not between 0 and 2? If it is quite rare, you may be able to get further speedups by removing the branch and use Loopvectorization.jl

``````function Optim∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
Fac = ifelse(0 < q < 2, αD*5*(q-2)^3*q / (8h*(q*h+1e-6)), 0.0)
return Fac .* xᵢⱼ
end
``````

And then use something like (don’t know exactly what you want here)

``````@tturbo for k in ...
xᵢⱼ[k] = Optim∇ᵢWᵢⱼ(αD,q,xᵢⱼ,h)
end
``````
1 Like

Actually, this may include a branch. Not sure.

In theory it should “never” happen, since the neighbor list would throw it out if it was above 2, so only the condition where it is close to zero is realistically possible. Still I like to keep it in

I tried to remove it and didn’t see any performance gain.

Kind regards

The gain would be when iterating over this in a hot loop. When all branches are removed it could be simd vectorized, with potentially huge performance improvements.

6 Likes

Okay got it - will take note of that.

There is another gain to squeeze and that is to remove the remaining division. The `q` factor would cancel if it wasn’t for the `1e-6` epsilon making sure the interaction zeroes at `q = 0`. But breaking the function at another point close to zero, allows to approximate with no division.

I didn’t really find a best approximation, but the final function should look something like this:

``````function Optim∇2(αD,q,invh)
Fac = ifelse(q < 2, ifelse(q < 2e-6, -αD*5*q*invh*1e6, αD*5*(q-2)^3*0.125*invh^2), 0.0)
return Fac
end
``````

This is for testing, the real `Optim` scalar multiplies with `x_ij`. Also, I suppose `q` never really goes below zero (Euclidean distance). In any case, this version uses `invh` i.e. `1/h` which can be calculated outside the hot loop.

``````julia> lineplot(q->Optim∇2(αD,q,ih), 0.0,0.0001)
┌────────────────────────────────────────┐
0 │⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ #87(x)
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
f(x)        │⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢣⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠸⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤│
-3 000 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0.0001⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

julia> lineplot(q->Optim∇1(αD,q,h), 0.0,0.0001)
┌────────────────────────────────────────┐
0 │⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ #89(x)
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢣⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⢸⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
f(x)        │⢸⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⢱⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⢣⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠑⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠈⠑⠒⠦⠤⠤⣀⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠓⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒│
-3 000 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀0.0001⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀x⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
``````

This is a comparison chart. Note the harsher knee with the divisionless approximation. The softer knee can be recovered with a quadratic approximation for small `q`s (still no division). Anyway, this will not get a 10x increase for sure.

BTW if you are running on an M1 CPU Mac, the difference might be even smaller, as division seems to be quite fast on it.

5 Likes

In terms of the actual particle simulation, there are two options: update, or not, the pair list at every step.

It is common to compute the list up to a distance greater than the actual cutoff, and do not upate it at every step. In that case you need to compute the distance again in that loop and check for the cutoff. This is the easiest strategy if you later want to port the computation of the forces to the GPU, for example. This is what Molly does.

Alternatively, you can update the list at every step. In that case you don’t need to check for the distance again.

The first is usually faster, but not always the particles move in a pace that can justify not updating the lists at a reasonable interval.

3 Likes

Yep, that is also a very valid approach! I cannot implement it yet though, since I do not know how to estimate when particles “roughly” have left a box.

I am not entirely sure how you handle it in the CellListMap package, but I have not noticed any significant delay in updating the neighbourlist at each time step (even at 1e-5 increments).

If you know of some material where to read about “when to update” the list, I would love to see it. Interesting to see how much there is in common with Molecular simulations

Kind regards