Performance improvements for simple molecular surface caclulation

I have a fairly straightforward calculation I’ve been playing around with, and it seems like a good opportunity to learn a little more about typical optimizations people reach for, as well as coding style.

At the core, we are applying a function to a grid of points:

using LinearAlgebra
 
gaussian_kernel = function (x, c; B=-0.5, r=1)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * ((norm(x - c[i])) - r^2))
    end
    return val
end

grid_size = 50;
x_range = range(-2., 2.; length=grid_size);
y_range = range(-2., 2.; length=grid_size);
z_range = range(-2., 2.; length=grid_size);
 
grid = [collect(x) for x in Iterators.product(x_range, y_range, z_range)];

atoms = [randn(3) for i in 1:100];
 
gaussian_kernel(grid[1], atoms)
@time map(x -> gaussian_kernel(x, atoms; r=0.6), grid);

Any suggestions for a more Julian way of writing this? I feel like I’m missing some key concepts that are impacting performance.

Otherwise, using a threaded map (tmap from ThreadTools.jl) does improve things, but not linear in the number of cores.

1 Like

I code seems pretty fine to me. You can explore, however, the use of StaticArrays. For example:

# Your code above this

map(x -> gaussian_kernel(x, atoms; r=0.6), grid); # obs: first run compiles
@time map(x -> gaussian_kernel(x, atoms; r=0.6), grid);

using StaticArrays
atoms = [ @SVector randn(3) for i in 1:100];
map(x -> gaussian_kernel(x, atoms; r=0.6), grid); # obs: first run compiles
@time map(x -> gaussian_kernel(x, atoms; r=0.6), grid);
# your code:
  0.786504 seconds (12.68 M allocations: 1.310 GiB, 1.23% gc time)
# Static arrays:
  0.208004 seconds (191.20 k allocations: 6.268 MiB)

x - c[i] allocates a temporary 3-component array in your inner loop, which is going to hurt performance a lot. As @lmiq suggested, 3-component coordinate vectors are a perfect case for SVector from the StaticArrays package.

Also, note that gaussian_kernel is a non-const global (as opposed to defining it in the usual way via function gaussian_kernel(...).

Finally, use @btime from the BenchmarkTools package for more accurate timing, and use $ to interpolate global variables when benchmarking.

The huge number of allocations here still indicates a problem. Possibly the type instability from the non-const definition of gaussian_kernel. Also, did you forget to use SVector for grid?

(If your innermost loops are allocating memory on the heap, you are leaving a lot of performance on the table. You wouldn’t call malloc in your innermost loop in a C program.)

6 Likes

This is about 6x faster (by using SVector and fixing type instabilities):

using StaticArrays, LinearAlgebra, BenchmarkTools

function gaussian_kernel(x, c; B=-0.5, r=1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * ((norm(x - c[i])) - r^2))
    end
    return val
end

grid_size = 50
x_range = range(-2., 2.; length=grid_size)
y_range = range(-2., 2.; length=grid_size)
z_range = range(-2., 2.; length=grid_size)

grid = [SVector(x[1],x[2],x[3]) for x in Iterators.product(x_range, y_range, z_range)]
atoms = [ @SVector randn(3) for i in 1:100 ]

@btime map(x -> gaussian_kernel(x, $atoms; r=0.6), $grid);

gives

  191.396 ms (2 allocations: 976.64 KiB)

versus 1.143 s (12500002 allocations: 1.30 GiB) for @btime map(x -> $gaussian_kernel(x, $atoms; r=0.6), $grid) with your original code on my laptop.

5 Likes

I didn’t, but in my test it didn’t make a difference.

What did you there to fix the type instability? I noticed there was one, but could figure out where it was or how to fix it.

Edit: I see, you changed gaussian_kernel = function... to function gaussian_kernel. I like that (not that I like this option more, but const gaussian_kernel = function .... also solves that type instability, and helps -me- to understand the problem).

1 Like

Thank you both! I had the feeling there was some Julia form — and some magic — that I was missing. It isn’t entirely clear to me why x - c[i] allocates memory. This is the sort of performance issue I’d love to be able to spot and rectify.

1 Like

The x-c[i] in principle allocates memory because it generates a vector which is the input of the norm function. It is like doing

v = x - c[i]
norm(v)

You are computing the distance between x and c[i], thus you could write a distance function, which does not allocate the vector difference first:

dist(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2+(x[3]-y[3])^2)

Note that this is faster and allocates less memory than the norm function applied

julia> using BenchmarkTools

julia> x = rand(3) ; y = rand(3) ;

julia> @btime dist($x,$y)
  3.092 ns (0 allocations: 0 bytes)
0.6698475658180884

julia> @btime norm($x-$y)
  40.606 ns (1 allocation: 112 bytes)
0.6698475658180884

(allocating is occuring in the x-y part, not in the norm function itself).

Thus, from a Julia perspective, two important things happened here:

  1. You CAN write your own code and not rely on functions from packages. This will be fast, and if the pre-programmed function does not do exactly what you need, it is possible that your implementation is faster. Of course, there are highly optimized functions for a lot of things, which will sometimes be faster than your own implementations, but testing is good idea always.
  2. Since Julia is sort of a dynamically-typed language, be careful with the definition of variables. Simply by defining the function using f = function .... end you introduced a new variable f with is assuming the value of the function you want, but that f could later be assigned to something else. This results in type instability (the type of f is not fixed, it can be function now but might be something else later in the code). This kind of type instability impairs performance. Thus, it is better to guarantee that things have constant types. Using function f() ... end or const f = function ... end are ways to do that, being the more natural function f(x) ... end syntax usually preferred.

The use of StaticArrays is more of a niche-sort of optimization, because these arrays are handled in memory in a different way and result in faster code. But not always they will be useful, in particular they are useful if dealing with lots of small arrays. You will probably want to test in your specific case if that improves the performance or not.

1 Like

If you use SVector it essentially does this for you.

4 Likes

Nunca a la cama te irás sin saber una cosa más.

Edit: as you can see I am in the process of learning these things now. You will find even similar questions I made just a few weeks ago if you search. And I cannot be more thankful to the community, with @stevengj being particularly helpful in this process.

Edit2:

Complete test of custom dist and LinearAlgebra.norm, with static or non-static arrays:

using BenchmarkTools
using StaticArrays
using LinearAlgebra

dist(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2+(x[3]-y[3])^2)

function fdist(x,y)
  d = 0.
  for i in 1:length(x)
    d = d + dist(x[i],y[i])
  end
  return d
end

function fnorm(x,y)
  d = 0.
  for i in 1:length(x)
    d = d + norm(x[i]-y[i])
  end
  return d
end

println("'Normal' arrays:")
x = [ rand(3) for i in 1:100 ] ; y = [ rand(3) for i in 1:100 ] ;
println("dist:")
@btime fdist($x,$y)
println("norm:")
@btime fnorm($x,$y)

println("Static arrays:")
x = [ @SVector rand(3) for i in 1:100 ] ; y = [ @SVector rand(3) for i in 1:100 ] ;
println("dist:")
@btime fdist($x,$y)
println("norm:")
@btime fnorm($x,$y)

Result:

'Normal' arrays:
dist:  
  309.860 ns (0 allocations: 0 bytes)
norm:  
  3.820 μs (100 allocations: 10.94 KiB)
Static arrays:
dist:  
  171.585 ns (0 allocations: 0 bytes)
norm:  
  172.393 ns (0 allocations: 0 bytes)

3 Likes

It’s such a great community. I’m still in the process of learning to “play the compiler”, and it’s very different from how I’m used to programming.

Again, thank you both for your help and insight here.

3 Likes

I was using your example to learn these things as well. Here are all the possibilities. The results are:

'Normal' arrays
 (1) f = function(...norm
  743.481 ms (12625004 allocations: 1.31 GiB)
 (2) function gauss(... norm
  739.420 ms (12500002 allocations: 1.30 GiB)
 (3) f = function(...dist
  178.855 ms (125004 allocations: 2.86 MiB)
 (4) function gauss(...dist
  174.363 ms (2 allocations: 976.64 KiB)
Static arrays
 (5) f = function(...norm
  164.523 ms (250004 allocations: 6.68 MiB)
 (6) function gauss(... norm
  160.820 ms (2 allocations: 976.64 KiB)
 (7) f = function(...dist
  165.860 ms (250004 allocations: 6.68 MiB)
 (8) function gauss(...dist
  157.895 ms (2 allocations: 976.64 KiB)

Conclusions:

  1. The use or norm without static arrays is a disaster (case 1), with huge memory allocations and slow.
  2. By replacing the norm function with a simple dist function (compare 1 and 3), the code is already quite fast, but still allocates too much.
  3. The remaining allocations are solved by solving the type instability (compare 3 and 4).
  4. Using static arrays improves all cases, but at the end (in this case) provides only a marginal (perhaps not significant) improvement in performance when compared to the code without type instabilities and the custom dist (compare 4 and 8).

Takeaway message (for me): at least if you take care of not having type instabilities and unnecessary allocations, it is likely the the code is close enough to a fast code (of course, not considering algorithmic improvements in general). For me, in particular, this is an important result, because I am used to program in Fortran and thus I usually do not fail in providing types and not allocating unnecesarily (Fortran more or less forces you to not to), thus using Julia as a Fortran programmer can only bring me benefits with everything else the language offers that Fortran does not.

The code is:

using LinearAlgebra
using BenchmarkTools
using StaticArrays

dist(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2+(x[3]-y[3])^2)

old = function(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * ((norm(x - c[i])) - r^2))
    end
    return val
end

old_dist = function(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * (dist(x,c[i]) - r^2))
    end
    return val
end

function gaussian_kernel(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * ((norm(x - c[i])) - r^2))
    end
    return val
end

function gaussian_kernel_dist(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * (dist(x,c[i]) - r^2))
    end
    return val
end

grid_size = 50;
x_range = range(-2., 2.; length=grid_size);
y_range = range(-2., 2.; length=grid_size);
z_range = range(-2., 2.; length=grid_size);
 
println("'Normal' arrays")

grid = [collect(x) for x in Iterators.product(x_range, y_range, z_range)];
atoms = [randn(3) for i in 1:100];

println(" f = function(...norm ")
@btime map(x -> old(x, $atoms; r=0.6), $grid);
println(" function gauss(... norm ")
@btime map(x -> gaussian_kernel(x, $atoms; r=0.6), $grid);
println(" f = function(...dist ")
@btime map(x -> old_dist(x, $atoms; r=0.6), $grid);
println(" function gauss(...dist ")
@btime map(x -> gaussian_kernel_dist(x, $atoms; r=0.6), $grid);

println("Static arrays")
atoms = [ @SVector randn(3) for i in 1:100];
grid = [ SVector(collect(x)...) for x in Iterators.product(x_range, y_range, z_range)];

println(" f = function(...norm ")
@btime map(x -> old(x, $atoms; r=0.6), $grid);
println(" function gauss(... norm ")
@btime map(x -> gaussian_kernel(x, $atoms; r=0.6), $grid);
println(" f = function(...dist ")
@btime map(x -> old_dist(x, $atoms; r=0.6), $grid);
println(" function gauss(...dist ")
@btime map(x -> gaussian_kernel_dist(x, $atoms; r=0.6), $grid);


println("end")
1 Like

I know I’m a broken record on LoopVectorization.jl, but I added it to your script. Results:

'Normal' arrays
 f = function(...norm
  646.984 ms (12625004 allocations: 1.31 GiB)
 function gauss(... norm
  641.349 ms (12500002 allocations: 1.30 GiB)
 f = function(...dist
  145.499 ms (125004 allocations: 2.86 MiB)
 function gauss(...dist
  141.914 ms (2 allocations: 976.64 KiB)
Loop Vectorization
  18.552 ms (2 allocations: 976.64 KiB)
Static arrays
 f = function(...norm
  138.706 ms (250004 allocations: 6.68 MiB)
 function gauss(... norm
  135.619 ms (2 allocations: 976.64 KiB)
 f = function(...dist
  136.244 ms (250004 allocations: 6.68 MiB)
 function gauss(...dist
  134.148 ms (2 allocations: 976.64 KiB)
end

Over 7x faster than the next fastest version, which is about inline with what you’d expect for SIMD vs non-SIMD code on a computer with AVX512.

New script:

using LinearAlgebra
using BenchmarkTools
using StaticArrays
using LoopVectorization

dist(x,y) = sqrt((x[1]-y[1])^2+(x[2]-y[2])^2+(x[3]-y[3])^2)

old = function(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * ((norm(x - c[i])) - r^2))
    end
    return val
end

old_dist = function(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * (dist(x,c[i]) - r^2))
    end
    return val
end

function gaussian_kernel(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * ((norm(x - c[i])) - r^2))
    end
    return val
end

function gaussian_kernel_dist(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    for i in 1:length(c)
        val += exp(κ * (dist(x,c[i]) - r^2))
    end
    return val
end
function gaussian_kernel_loopvec(x, c; B :: Float64 = -0.5, r :: Float64 = 1.0)
    κ = B / r^2;
    val = 0.0;
    @avx for i in axes(c,1)
        d² = 0.0
        for j ∈ 1:3
            d² += abs2(x[j] - c[i,j])
        end
        val += exp(κ * (sqrt(d²) - r^2))
    end
    return val
end

grid_size = 50;
x_range = range(-2., 2.; length=grid_size);
y_range = range(-2., 2.; length=grid_size);
z_range = range(-2., 2.; length=grid_size);
 
println("'Normal' arrays")

grid = [collect(x) for x in Iterators.product(x_range, y_range, z_range)];
atoms = [randn(3) for i in 1:100];

atomsmat = reduce(vcat, adjoint.(atoms));
@assert map(x -> gaussian_kernel_dist(x, atoms; r=0.6), grid) ≈ map(x -> gaussian_kernel_loopvec(x, atomsmat; r=0.6), grid)

println(" f = function(...norm ")
@btime map(x -> old(x, $atoms; r=0.6), $grid);
println(" function gauss(... norm ")
@btime map(x -> gaussian_kernel(x, $atoms; r=0.6), $grid);
println(" f = function(...dist ")
@btime map(x -> old_dist(x, $atoms; r=0.6), $grid);
println(" function gauss(...dist ")
@btime map(x -> gaussian_kernel_dist(x, $atoms; r=0.6), $grid);

println("Loop Vectorization")
@btime map(x -> gaussian_kernel_loopvec(x, $atomsmat; r=0.6), $grid)

println("Static arrays")
atoms = [ @SVector randn(3) for i in 1:100];
grid = [ SVector(collect(x)...) for x in Iterators.product(x_range, y_range, z_range)];

println(" f = function(...norm ")
@btime map(x -> old(x, $atoms; r=0.6), $grid);
println(" function gauss(... norm ")
@btime map(x -> gaussian_kernel(x, $atoms; r=0.6), $grid);
println(" f = function(...dist ")
@btime map(x -> old_dist(x, $atoms; r=0.6), $grid);
println(" function gauss(...dist ")
@btime map(x -> gaussian_kernel_dist(x, $atoms; r=0.6), $grid);

println("end")
6 Likes

This is fantastic. I tried Loop vectorization there, but clearly I have to learn how to rewrite the loop to take advantage of it. As the loop was I was unable to get any benefits from the @avx flag

At some point I should expand on the README / some introductory material on how to write loops friendly to SIMD in general (and hopefully by extension LoopVectorization as well).

A key for SIMD in general is that you want to be able to do multiple of the same operations in parallel. That makes the loops across the grid and across the atoms good candidates.

The loop for distances is not: we have 3 dependent iterations.

For simplicity’s sake, I chose the atom loop inside gaussian_kernel. The next step is to make sure your data is organized in a suitable fashion.
Contiguous loads are fastest, so I made an n_atom x n_features (i.e. 100 x 3) matrix:

atomsmat = reduce(vcat, adjoint.(atoms));

(Certainly not the fastest way to get the matrix in terms of runtime, so if transformations happen often, I’d suggest re-implementing.)

Now the loop can efficiently load multiple atoms at a time.
So now I just made the corresponding adjustments to the code, verified the answer was correct, and ran the benchmarks.

7 Likes