Nerd-sniping: can you make this faster?

lmiq · September 30, 2025, 8:10pm

This is the core function of a module to compute solvent accessible surface areas of molecules.
It works fine, but it would benefit of being faster. Without a full reworking of the data structures, etc, performance boils down to how fast this low-level function can run.

Rules: do not multi-thread. This already happens at a higher level.

Note that exposed_i is a BitArray, which has to be updated.

using BenchmarkTools
using StaticArrays
#
# Input data
#
function data()
    x = rand(SVector{3,Float32}) 
    y = rand(SVector{3,Float32})
    dot_cache_i = [rand(SVector{3,Float32}) for _ in 1:200]
    exposed_i=trues(200)
    rj_sq=0.1
    return x, y, dot_cache_i, exposed_i, rj_sq
end

#
# function to accelerate
#
function update_dot_exposure!(x, y, dot_cache_i, exposed_i, rj_sq)
    for idot in eachindex(exposed_i)
        if exposed_i[idot]
            dot_on_surface = x + dot_cache_i[idot]
            # Position the dot on the atom's surface in the molecule's coordinate system
            # Check if the dot is inside the neighboring atom j
            if sum(abs2, dot_on_surface - y) < rj_sq
                exposed_i[idot] = false
            end
        end
    end
    return nothing
end

I get:

julia> @benchmark update_dot_exposure!($(data())...)
BenchmarkTools.Trial: 10000 samples with 737 evaluations per sample.
 Range (min … max):  171.301 ns … 206.906 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     177.859 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   177.764 ns ±   1.354 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                ▂  ▁▁▁▂▂▂▂▂▂▂▃▂▁ ▅█  ▂▂▂▃▃                      ▁
  ▅▁▁▁▁▃▁▃▃▄▁▃▁▇███████████████████████████▇▆▆▆▆▇▇▆▅▅▆▅▅▅▇▅▆▅▅▅ █
  171 ns        Histogram: log(frequency) by time        184 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

abraemer · September 30, 2025, 8:40pm

A couple of observations:

I am on mobile so can’t run tests
we can pull out the computation of x-y

function update_dot_exposure!(x, y, dot_cache_i, exposed_i, rj_sq)
    delta = x - y
    for idot in eachindex(exposed_i)
        if exposed_i[idot]
            # Position the dot on the atom's surface in the molecule's coordinate system
            # Check if the dot is inside the neighboring atom j
            if sum(abs2, delta + dot_cache[idot]) < rj_sq
                exposed_i[idot] = false
            end
        end
    end
    return nothing
end

in your test data, all exposed_i are true so the branch is always taken. How realisitic is that, i.e. how large is the fraction of values that actually needs updating?
if you usually update most of the array then I’d expect this to perform better because it is branchless:

function update_dot_exposure!(x, y, dot_cache_i, exposed_i, rj_sq)
    delta = x - y
    for idot in eachindex(exposed_i)
        exposed_i[idot] = exposed_i[idot] & sum(abs2, delta + dot_cache[idot]) > rj_sq
    end
    return nothing
end

how constant is dot_cache? If it changes rarely then you can rephrase your loop as nearest neighbour search around the point x-y and there exist fast data structures for this kind of lookup (usually some form of tree). Precomputing this in this function however will make it slower
this brings me to last point: It probably not easy/worth it to keyhole-optimize this function. Large gains are unlikely. For a more serious attempt at optimization we need more context to maybe find better data structures (see idea with nearest neighbor search in the previous point)

lmiq · September 30, 2025, 9:37pm

Thanks, the pulling out of x - y is a clear gain.

The branch is necessary, the fraction of falses can be large, and I already tried unbranched versions, but they are much slower.

Al that is already within the context of a nearest neighbor search, and that function is called for the pairs of particles that are within the cutoff of that search. Thus, this is not a “premature” optimization, I’m trying to get the most of the details now.

matthias314 · September 30, 2025, 9:39pm

EDIT: exposed_i should be a BitVector, but that’s not true below. Changing it to BitVector leads to much worse performance, at least without further changes.

With

exposed_i= rand(Bool, 200)  # more realistic?
rj_sq= Float32(0.1)  # avoids conversion later on

in data() and writing @abraemer’s function as

function update_dot_exposure!_abraemer(x, y, dot_cache_i, exposed_i, rj_sq)
    exposed_i .&= sum.(abs2, Ref(x-y) .+ dot_cache_i) .> rj_sq
    return nothing
end

I get

julia> @b update_dot_exposure!($(data())...)
1.304 μs

julia> @b update_dot_exposure!_abraemer($(data())...)
87.460 ns

A question: Is 200 a typical length for the data vectors or can they be longer?

Zentrik · September 30, 2025, 9:42pm

If you’re willing to change dot_cache_i to a SOA form (AoS and SoA - Wikipedia), I get a 6x speedup (I also changed exposed_i to a Vector{Bool} as I couldn’t be bothered to write the code for BitArray and added @fastmath but I don’t think that adds much here).
This doesn’t handle the case where the length is not a multiple of 8, but that should be fairly cheap to handle.
Also, I’m sure this can be sped up more.

using StaticArrays, SIMD
function data_soa()
    x = rand(SVector{3,Float32}) 
    y = rand(SVector{3,Float32})
    dot_cache_x = rand(Float32, 200)
    dot_cache_y = rand(Float32, 200)
    dot_cache_z = rand(Float32, 200)
    exposed_i=Vector{Bool}(trues(200))
    rj_sq=0.1
    return x, y, dot_cache_x, dot_cache_y, dot_cache_z, exposed_i, rj_sq
end

@fastmath function update_dot_exposure_v3!(x, y, dot_cache_x, dot_cache_y, dot_cache_z, exposed_i, rj_sq)
    delta = x - y

    N = 8
    # Assume length(exposed_i) is multiple of N
    lane = VecRange{N}(0)

    @inbounds @fastmath for i in 1:N:length(exposed_i)
        if any(exposed_i[lane + i])
            pos_x = dot_cache_x[lane + i] + delta[1]
            pos_y = dot_cache_y[lane + i] + delta[2]
            pos_z = dot_cache_z[lane + i] + delta[3]

            dist_sq = pos_x*pos_x + pos_y*pos_y
            dist_sq += pos_z*pos_z

            outside = dist_sq >= rj_sq
            exposed_i[lane + i] = outside
        end
    end
    return exposed_i
end

julia> @benchmark update_dot_exposure!($(data())...)
BenchmarkTools.Trial: 10000 samples with 234 evaluations per sample.
 Range (min … max):  318.765 ns …  1.008 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     322.624 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   325.867 ns ± 19.457 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▇██▇▃▂▁▂▂▂▁▁▁▁                                              ▂
  █████████████████▇▇▇▆▆▇▇▆▇▇▇▆▆▆▆▆▇▆▆▇▆▅▅▅▄▆▅▅▆▅▅▄▆▄▄▅▁▄▅▄▃▄▅ █
  319 ns        Histogram: log(frequency) by time       391 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark update_dot_exposure_v3!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 984 evaluations per sample.
 Range (min … max):  55.512 ns … 220.020 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.571 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   57.810 ns ±   6.244 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▃▁▁▁▁▁                                                     ▁
  █████████▇▇▇██▇▆▆▆▅▆▇▆▆▅▄▅▅▅▄▃▆▃▅▅▅▄▃▅▄▁▃▁▄▄▄▃▄▄▃▃▅▄▄▄▄▄▄▅▄▅ █
  55.5 ns       Histogram: log(frequency) by time        97 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Zentrik · September 30, 2025, 9:49pm

Have you tried LoopVectorization.jl?

stevengj · October 1, 2025, 1:02am

Right, do you really need a BitVector, as opposed to Vector{Bool}? The latter has much better performance for accessing individual elements one at a time.

If the fraction of falses is large enough, maybe it would be a win to use a BitVector and operate directly on the exposed_i.chunks array of UInt64 values, since then you can look at 64 bits at a time. If exposed_i.chunks[i] is zero, you can skip the whole chunk, and even if it’s nonzero you can maybe do some bit-twiddling to find the nonzero bits more quickly.

lmiq · October 1, 2025, 1:43am

Can be longer but 100-500 is quite typical.

I can do that, yes. Is it mandatory to get SIMD there?

Not really. I was using BitVector because those are 100-500 element Bool vectors per atom. With 10k atoms that may start to be a problem. With BitVector the memory is much more compact, and in my original implementation the performance was not very different. Skipping whole chunks seems useful, although with what I tried I get roughly the same performance I have.

This was my quick experiment, I have no idea for now how to improve the true-bit mutation. I guess if I could chunk that in smaller chunks and not pay a very high price on that mutation, it would be the ideal case.

function update_dot_exposure_fast!(deltaxy, dot_cache_i, exposed_i, R_j_sq)
    @inbounds for i in eachindex(exposed_i.chunks)
        if exposed_i.chunks[i] != 0
            first = (i-1)*64 + 1
            last = min(first + 63, length(exposed_i))
            for idot in first:last
                exposed_i[idot] = exposed_i[idot] & (sum(abs2, dot_cache_i[idot] + deltaxy) > R_j_sq)
            end
        end
    end
    return nothing
end

I’m focusing now on getting SIMD to work. But I need to reorganize the data structure. That seems more promising (also because there chunks are of 8 elements, thus I can skip more chunks).

I did, but it didn`t recognize the loop operations as they are/were, and honestly I’m not very enthusiastic about having it as a dependency.

lmiq · October 1, 2025, 2:08am

For now the fastest alternative I have is @matthias314 one’s:

using BenchmarkTools
using StaticArrays
#
# Input data
#
function data()
    x = rand(SVector{3,Float32}) 
    y = rand(SVector{3,Float32})
    dot_cache_i = [rand(SVector{3,Float32}) for _ in 1:200]
    exposed_i=[ rand() > 0.8 for _ in 1:200 ]
    rj_sq=0.1f0
    return x-y, dot_cache_i, exposed_i, rj_sq
end

function update_dot_exposure2!(deltaxy, dot_cache_i, exposed_i, rj_sq)
    exposed_i .&= sum.(abs2, Ref(deltaxy) .+ dot_cache_i) .> rj_sq
    return nothing
end

For which I get:

julia> @benchmark update_dot_exposure2!($(data())...)
BenchmarkTools.Trial: 10000 samples with 991 evaluations per sample.
 Range (min … max):  41.713 ns … 79.242 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     42.464 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.683 ns ±  1.007 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂      ▆█▄        ▁▂▁                                       ▁
  █▆▁▁▁▃▁███▆▄▄▅▅▆█▆████▇▆█▆▆▅▅▅▅▆▃▆▅▃▄▄▁▅▆▄▅▄▁▃▄▆▄▄▁▃▄▅▅██▇█ █
  41.7 ns      Histogram: log(frequency) by time      47.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vs. ~170 ns before (good!). With the BitVector I get ~110 ns.

Here the fraction of falses of exposed_i does not affect the performance. I guess that if we can get it to skip the falses with some good chance, it can be even better, but with the naive things I tried loosing the SIMD or even just checking for true values is not worth it.

Zentrik · October 1, 2025, 7:59am

Given matthias314’s version is fast for you (it’s not for me interestingly), I guess not. SOA just makes SIMDing this easy.

lmiq · October 1, 2025, 8:53am

I’ll still check if your manual SIMD can be faster, because it allows the skipping of blocks.

(actually yours, in isolation, might be faster, I just didn’t mention it above because I didn’t implement it in the actual program because of then necessary changes in the data structure)

sgaure · October 1, 2025, 9:54am

I’ve been into this problem of skipping falses, and it can be done quite fast if the Bools are stored as bits, like in BitVector. Then you can use trailing_zeros (which translate to a single instruction). However, I don’t know if this will speed things up here since the loops are more involved:

blsr(x) = x & (x - one(x))  # clear lowest set bit, typically translates to single blsr inst

function sumbits(bits)
    a = 0
    while bits ≠ 0
        i = trailing_zeros(bits) % Int
        a += i
        bits = blsr(bits)
    end
    return a
end

bits = rand(UInt)
julia> @code_native debuginfo=:none dump_module=false sumbits(bits)
...
L32:
        tzcnt   rcx, rdi
        add     rax, rcx
        blsr    rdi, rdi
        jne     L32
L47:
        pop     rbp
        ret

sgaure · October 1, 2025, 10:54am

I tried this one, by using internals of BitVector:


function data2()
    x = rand(SVector{3,Float32})
    y = rand(SVector{3,Float32})
    dot_cache_i = [rand(SVector{3,Float32}) for _ in 1:200]
    exposed_i=BitVector([ rand() > 0.8 for _ in 1:200 ])
    rj_sq=0.1f0
    return x-y, dot_cache_i, exposed_i, rj_sq
end

blsr(x) = x & (x - one(x))
clearbit(bits, i)= bits & ~(oftype(bits,1)<<i)

function update_dot_exposure2!(deltaxy, dot_cache_i, exposed_i, rj_sq)
    offset = 0
    for ch in eachindex(exposed_i.chunks)
        chunk = exposed_i.chunks[ch]
        newchunk = chunk
        while chunk ≠ 0
            i = offset + trailing_zeros(chunk)
            r = sum(abs2, deltaxy + dot_cache_i[i+1]) > rj_sq
            r || (newchunk = clearbit(newchunk, i))
            chunk = blsr(chunk)
        end
        exposed_i.chunks[ch] = newchunk
        offset += 8sizeof(chunk)
    end
    return nothing
end

d2 = data2();
@benchmark update_dot_exposure2!(d...) setup=(d=d2)

lmiq · October 1, 2025, 11:18am

This one seems faster, but it is not giving the same result, so I’ll have to check that. Thanks for the hints!

sgaure · October 1, 2025, 11:20am

Ah, I haven’t checked that it does exactly the same! And the speed is a tradeoff between density of trues and the use of SIMD.

sgaure · October 1, 2025, 11:29am

Seems I messed with when to add the offset:

function update_dot_exposure2!(deltaxy, dot_cache_i, exposed_i, rj_sq)
    offset = 0
    for ch in eachindex(exposed_i.chunks)
        chunk = exposed_i.chunks[ch]
        newchunk = chunk
        while chunk ≠ 0
            i = trailing_zeros(chunk)
            r = sum(abs2, deltaxy + dot_cache_i[offset + i+1]) > rj_sq
            r || (newchunk = clearbit(newchunk, i))
            chunk = blsr(chunk)
        end
        exposed_i.chunks[ch] = newchunk
        offset += 8sizeof(chunk)
    end
    return nothing
end

lmiq · October 1, 2025, 12:05pm

Very cool! We are much faster now!

The faster method now is @Zentrik’s manual SIMD using the SoA data structure. With some small updates for the actual application, I get:

SoA - SIMD

using StaticArrays
using SIMD: VecRange

# passing N as a value type just felt right. 
function update_dot_exposure!(deltaxy, dot_cache, exposed_i, rj_sq, ::Val{N}) where {N}
    lastN = N * (length(exposed_i) ÷ N)
    lane = VecRange{N}(0)
    @inbounds for i in 1:N:lastN
        if any(exposed_i[lane + i])
            pos_x = dot_cache.x[lane + i] + deltaxy[1]
            pos_y = dot_cache.y[lane + i] + deltaxy[2]
            pos_z = dot_cache.z[lane + i] + deltaxy[3]
            dist_sq = pos_x*pos_x + pos_y*pos_y
            dist_sq += pos_z*pos_z
            outside = dist_sq >= rj_sq
            exposed_i[lane + i] &= outside
        end
    end
    # Remaining 
    @inbounds for i in lastN+1:length(exposed_i)
        pos_x = dot_cache.x[i] + deltaxy[1]
        pos_y = dot_cache.y[i] + deltaxy[2]
        pos_z = dot_cache.z[i] + deltaxy[3]
        dist_sq = pos_x*pos_x + pos_y*pos_y
        dist_sq += pos_z*pos_z
        outside = dist_sq >= rj_sq
        exposed_i[i] &= outside
    end
    return exposed_i
end

struct DotCache{T}
    x::Vector{T}
    y::Vector{T}
    z::Vector{T}
end

function data_soa()
    x = rand(SVector{3,Float32}) 
    y = rand(SVector{3,Float32})
    dot_cache = DotCache(rand(Float32,200), rand(Float32,200), rand(Float32,200))
    exposed_i=[ rand() > 0.8 for _ in 1:200 ]
    rj_sq=0.1f0
    N=Val(16)
    return x-y, dot_cache, exposed_i, rj_sq, N
end

With which I get:

julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min … max):  22.755 ns … 35.858 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.578 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.562 ns ±  0.542 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂        ▅▇▁       ▇█▁▁      ▂▅                             ▂
  ██▃▁▁▁▁▃▁███▅▆▆▅▆▅▆████▆▆▅▆▅▅███▅▄▅▇▇▆▇██▇▆▆▇▇▇▇▇█▆▇▆▆▆▇▆▄▆ █
  22.8 ns      Histogram: log(frequency) by time      25.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

With @sgaure’s bitvector approach, I’m quite close (but still the results do not exactly match, something I need to investigate):

BitVector

using StaticArrays

blsr(x) = x & (x - one(x))
clearbit(bits, i)= bits & ~(oftype(bits,1)<<i)
function update_dot_exposure!(deltaxy, dot_cache_i, exposed_i, rj_sq)
    offset = 0
    for ch in eachindex(exposed_i.chunks)
        chunk = exposed_i.chunks[ch]
        newchunk = chunk
        while chunk ≠ 0
            i = trailing_zeros(chunk)
            r = sum(abs2, deltaxy + dot_cache_i[offset + i+1]) > rj_sq
            r || (newchunk = clearbit(newchunk, i))
            chunk = blsr(chunk)
        end
        exposed_i.chunks[ch] = newchunk
        offset += 8sizeof(chunk)
    end
    return nothing
end

function data_bitvector()
    x = rand(SVector{3,Float32})
    y = rand(SVector{3,Float32})
    dot_cache_i = [rand(SVector{3,Float32}) for _ in 1:200]
    exposed_i=BitVector([ rand() > 0.8 for _ in 1:200 ])
    rj_sq=0.1f0
    return x-y, dot_cache_i, exposed_i, rj_sq
end

julia> @benchmark update_dot_exposure!($(data_bitvector())...)
BenchmarkTools.Trial: 10000 samples with 991 evaluations per sample.
 Range (min … max):  40.841 ns … 66.689 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     42.267 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   42.046 ns ±  0.546 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▅██▅▁           ▄██▆▂     ▁        ▁     ▁▁▁ ▂
  ▅▇▄▁▁▁▁▁▁▃▁▁▁▃██████▇▆▆▅▇▆▆▆▆▆██████▆▅▅█████▇█▇▇▆▇█▇▇▇▆████ █
  40.8 ns      Histogram: log(frequency) by time      43.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

With the fastest implementation so far, we very, very, close to the fastest SASA calculation I know of (and we can thread on top of this with additional gains).

matthias314 · October 1, 2025, 12:10pm

Since vectorized code can process 8 or 16 Float32 in parallel (depending on your processor), I would think that one needs at least 7/8 or 15/16 of all exposed_i values to be false for a non-vectorized version to be faster. However, in @lmiq’s example it’s 80%. Also, the chance of a whole 64-bit chunk to be 0 is then less than 10^{-6}, assuming that the values are independent. In @lmiq’s latest benchmarks the non-vectorized version is as fast as the vectorized one (without SoA). Is it realistic to assume that one can do better?

sgaure · October 1, 2025, 12:18pm

@Zentrik’s SoA (or “FORTRAN code in julia”) rules! The trailing_zeros code work reasonably well for low density of trues (break even with your code for 0.3 on my threadripper, and faster for the 0.2 in the examples).

matthias314 · October 1, 2025, 12:28pm

With the SoA version

sumabs2(x, y, z) = @fastmath abs2(x) + abs2(y) + abs2(z)

function update_dot_exposure_vec!(x, y, dot_cache_x, dot_cache_y, dot_cache_z, exposed_i, rj_sq)
    delta = x-y
    exposed_i .&= sumabs2.(delta[1] .+ dot_cache_x, delta[2] .+ dot_cache_y, delta[3] .+ dot_cache_z) .> rj_sq
    return nothing
end

one can get quite close to @Zentrik’s performance:

julia> @b update_dot_exposure_v3!($(data_soa())...)
50.411 ns

julia> @b update_dot_exposure_vec!($(data_soa())...)
56.582 ns

EDIT: even closer with @fastmath. Also, I changed rj_sq in data_soa() to be of type Float32. That makes a big difference for the function above, but not for @Zentrik’s.

Topic		Replies	Views
Can I make my code faster with parallelism, or just plain better coding? Performance	33	2889	February 19, 2019
Performance improvements for simple molecular surface caclulation Performance	15	686	September 14, 2020
Optimizing iteration over slices of multiple matrices Performance question , array	14	1946	December 30, 2018
Innefficient paralellization? Need some help optimizing a simple dot product Performance question , parallel	32	4889	March 28, 2018
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2247	March 17, 2021

Nerd-sniping: can you make this faster?

Related topics