Nerd-sniping: can you make this faster?

lmiq · October 1, 2025, 12:38pm

I’ve pulled out the deltaxy calculation, so it could still be unfair there. But in practice (in the actual application), I get, with this last version:

julia> @b atomic_sasa(prot; parallel=false)
13.640 ms (67044 allocs: 2.888 MiB)

while with Zentrik’s I get:

julia> @b atomic_sasa(prot; parallel=false)
9.760 ms (67044 allocs: 2.888 MiB)

FWIW, this is what we gained so far:

julia> @b atomic_sasa(prot; parallel=false)
80.444 ms (65546 allocs: 2.248 MiB)

(not irreleveant at all, since we need to run this calculation over thousands of frames of MD trajectories)

barucden · October 1, 2025, 12:42pm

Not sure how much of a help it would be (if any), but

\sum_{i=1}^n (\delta_i + c_i^j)^2 > R^2, \forall j

can be also written as

\sum_{i=1}^n (2\delta_i + c_i^j) c_i^j > R^2 - \sum_{i=1}^n \delta_i^2, \forall j.

The right-hand side does not depend on j, so it needs to be computed only once. I suppose the left-hand side would translate to

lhs = ((dot_cache.x[ind] + deltaxy[1]) * deltaxy[1] + 
       (dot_cache.y[ind] + deltaxy[2]) * deltaxy[2] + 
       (dot_cache.z[ind] + deltaxy[3]) * deltaxy[3])

and the right-hand side would be fixed to

rhs = rj_sq - sum(abs2, deltaxy)

Edit: And obviously the condition of interest:

exposed_i[ind] &= (lhs > rhs)

lmiq · October 1, 2025, 12:58pm

Nice catch, but there was no speedup:

julia> @benchmark atomic_sasa($prot; parallel=false)
BenchmarkTools.Trial: 507 samples with 1 evaluation per sample.
 Range (min … max):  9.351 ms …  17.197 ms  ┊ GC (min … max): 0.00% … 41.86%
 Time  (median):     9.491 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.859 ms ± 820.727 μs  ┊ GC (mean ± σ):  2.58% ±  6.11%

  ▃█▇▄▃▂▂▂▁▂                                       ▁           
  ██████████▇█▇▇▅▆██▇▇▇▇▄▅▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▇██▇▇▆▆▄▁▅▅ ▇
  9.35 ms      Histogram: log(frequency) by time      12.2 ms <

 Memory estimate: 2.89 MiB, allocs estimate: 67044.

vs (previously):

julia> @benchmark atomic_sasa($prot; parallel=false)
BenchmarkTools.Trial: 510 samples with 1 evaluation per sample.
 Range (min … max):  9.306 ms …  16.500 ms  ┊ GC (min … max): 0.00% … 40.93%
 Time  (median):     9.469 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.824 ms ± 829.032 μs  ┊ GC (mean ± σ):  2.53% ±  6.20%

  ▁█▄                                                          
  ███▅▄▄▄▃▄▃▄▃▅▅▃▃▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▂▂▃▂▃▃▂▂▁▂▃ ▃
  9.31 ms         Histogram: frequency by time        12.2 ms <

 Memory estimate: 2.89 MiB, allocs estimate: 67044.

sgaure · October 1, 2025, 1:30pm

Btw, note that by using StructArrays.jl you can conveniently view the SoA also as an AoS.

s = StructArray{SVector{3,Float32}}((dot_cache.x, dot_cache.y, dot_cache.z))
200-element StructArray(::Vector{Float32}, ::Vector{Float32}, ::Vector{Float32}) with eltype SVector{3, Float32}:
 [0.54684204, 0.0305565, 0.19615191]
 [0.12136942, 0.9855579, 0.84889585]
...

lmiq · October 1, 2025, 2:11pm

I’m settling down with this version, based on Zentrik’s suggestion:

using SIMD: VecRange
using StaticArrays
using BenchmarkTools

function update_dot_exposure!(deltaxy, dot_cache, exposed_i, rj_sq, ::Val{N}) where {N}
    lastN = N * (length(exposed_i) ÷ N)
    lane = VecRange{N}(0)
    @inbounds for i in 1:N:lastN
        if any(exposed_i[lane + i])
            pos_x = dot_cache.x[lane + i] + deltaxy[1]
            pos_y = dot_cache.y[lane + i] + deltaxy[2]
            pos_z = dot_cache.z[lane + i] + deltaxy[3]
            exposed_i[lane + i] &= sum(abs2, (pos_x, pos_y, pos_z)) >= rj_sq
        end
    end
    # Remaining 
    @inbounds for i in lastN+1:length(exposed_i)
        pos_x = dot_cache.x[i] + deltaxy[1]
        pos_y = dot_cache.y[i] + deltaxy[2]
        pos_z = dot_cache.z[i] + deltaxy[3]
        exposed_i[i] &= sum(abs2, (pos_x, pos_y, pos_z)) >= rj_sq
    end
    return exposed_i
end

struct DotCache{T}
    x::Vector{T}
    y::Vector{T}
    z::Vector{T}
end

function data_soa()
    x = rand(SVector{3,Float32}) 
    y = rand(SVector{3,Float32})
    dot_cache = DotCache(rand(Float32,200), rand(Float32,200), rand(Float32,200))
    exposed_i=[ rand() > 0.8 for _ in 1:200 ]
    rj_sq=0.1f0
    N=Val(16)
    return x-y, dot_cache, exposed_i, rj_sq, N
end

With the following performance:

julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min … max):  22.367 ns … 47.081 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     23.172 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.140 ns ±  0.437 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▅▆         ▃█▆          ▂                        ▂
  ▇▇▁▁▁▁▁▁▁▃▁███▅▆▄▄▅▆▅▄███▆▆▄▄▅▆▄▅▅▆██▄▆▆▆▅▄▅▅▅▅▇▆▅▇███▇▆▆▆▇ █
  22.4 ns      Histogram: log(frequency) by time      24.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Feel free to squeeze it further

Thanks all for the help!

matthias314 · October 1, 2025, 11:00pm

If one consistently uses the fixed-length vector types from SmallCollections.jl (disclaimer: I’m the author), then the easy-peasy broadcasting version becomes noticeably faster:

julia> @b update_dot_exposure!($(data_soa())...)  # lmiq's latest version
28.198 ns

julia> @b update_dot_exposure_small!($(data_soa_small())...) 
18.962 ns

Code

using SmallCollections: FixedVector, MutableFixedVector

struct DotCacheSmall{V}
    x::V
    y::V
    z::V
end

function data_soa_small()
    x = rand(FixedVector{3,Float32})
    y = rand(FixedVector{3,Float32})
    M = 200
    dot_cache = DotCacheSmall((FixedVector{M}(rand(Float32, M)) for _ in 1:3)...)
    exposed_i = MutableFixedVector{M}([ rand() > 0.8 for _ in 1:M ])
    rj_sq = 0.1f0
    return x-y, dot_cache, exposed_i, rj_sq
end

sumabs2(x, y, z) = @fastmath abs2(x) + abs2(y) + abs2(z)

function update_dot_exposure_small!(deltaxy, dot_cache, exposed_i, rj_sq)
    @fastmath exposed_i .&= sumabs2.(deltaxy[1] .+ dot_cache.x, deltaxy[2] .+ dot_cache.y, deltaxy[3] .+ dot_cache.z) .>= rj_sq
    return nothing
end

Some comments:

This requires the current GitHub version SmallCollections#master. The published version has a missing @inline that spoils performance.
For some reason this doesn’t work with StaticArrays.jl.
Now the length of the data vectors is fixed. This is of course not what one wants, but one might be able to chop the data into chunks of fixed size, say 128 or 256.

EDIT: Here is a proof of concept with a wrapper ChunkedVector (with chunk size 64). This now works for arbitrary data vectors.

New code

const M = 200  # length of data vectors
const C = 64   # chunk size

using SmallCollections: FixedVector, MutableFixedVector

# chunked vectors

struct ChunkedVector{N,T} <: AbstractVector{T}
    chunks::Vector{MutableFixedVector{N,T}}
    n::Int
end

function ChunkedVector{N}(w::AbstractVector{T}) where {N,T}
    m = (length(w)-1) ÷ N + 1
    v = ChunkedVector([zero(MutableFixedVector{N,T}) for _ in 1:m], length(w))
    copyto!(v, w)
end

Base.size(v::ChunkedVector) = (v.n,)

function Base.getindex(v::ChunkedVector{N}, i::Int) where N
    p, q = divrem(i-1, N)
    v.chunks[p+1][q+1]
end

function Base.setindex!(v::ChunkedVector{N}, x, i::Int) where N
    p, q = divrem(i-1, N)
    v.chunks[p+1][q+1] = x
end

#

struct DotCacheSmall{V}
    x::V
    y::V
    z::V
end

sumabs2(x, y, z) = @fastmath abs2(x) + abs2(y) + abs2(z)

function update_dot_exposure_chunked!(deltaxy, dot_cache, exposed_i, rj_sq)
    for (x, y, z, e) in zip(dot_cache.x.chunks, dot_cache.y.chunks, dot_cache.z.chunks, exposed_i.chunks)
        @fastmath e .&= sumabs2.(deltaxy[1] .+ x, deltaxy[2] .+ y, deltaxy[3] .+ z) .>= rj_sq
    end
end

function data_soa_chunked()
    x = rand(FixedVector{3,Float32})
    y = rand(FixedVector{3,Float32})
    dot_cache = DotCacheSmall((ChunkedVector{C}(rand(Float32, M)) for _ in 1:3)...)
    exposed_i = ChunkedVector{C}([ rand() > 0.8 for _ in 1:M ])
    rj_sq = 0.1f0
    return x-y, dot_cache, exposed_i, rj_sq
end

Performance varies depending on the length. For vectors of length 200 I get

julia> @b update_dot_exposure!($(data_soa())...)
28.421 ns

julia> @b update_dot_exposure_chunked!($(data_soa_chunked())...)
25.395 ns

and for length 300

julia> @b update_dot_exposure!($(data_soa())...) 
40.457 ns

julia> @b update_dot_exposure_chunked!($(data_soa_chunked())...)
31.273 ns

lmiq · October 2, 2025, 1:22am

You mean the M of the first code? I don’t think I need more flexibility than that (only perhaps because it might stress the compiler if the length varies across executions?)

Ps. Is 200 a small collection? Or we might start having compilation time issues as with StaticArrays?

There was that other package, I think, FixedSizeArrays.jl , can that help?

matthias314 · October 2, 2025, 2:33am

Yes, that’s the length of the data vectors in the first code.

Is 200 a small collection? Or we might start having compilation time issues as with StaticArrays?

It’s not small in the sense of SmallCollections.jl, so I’m surprised myself that it leads to fast code. But I did notice long compilation times, more than for the chunked version.

FixedSizeArrays.jl , can that help?

That would be worth trying out!

matthias314 · October 2, 2025, 12:52pm

Turns out that it makes the broadcasting approach much slower.

giordano · October 2, 2025, 3:25pm

Can you open an issue with a minimal reproducer in the repo?

giordano · October 2, 2025, 3:54pm

But also, much slower…than what? What did you swap for what? I’ve been trying to look at this myself, but there are so many different code examples in this thread that I have no idea of what I’m looking at.

If you swapped StaticArrays with FixedSizeArrays, I tried hard to explain in the docs that FixedSizeArrays.jl is not a replacement for StaticArrays.jl, the only property they share is not being able to resize the arrays, for the rest they’re completely different. FixedSizeArray is supposed to be very close to Array, but not resizable, which is also what the JuliaCon talk tried to convey already from the title.

matthias314 · October 2, 2025, 4:00pm

github.com/JuliaArrays/FixedSizeArrays.jl

slow broadcasting with assignment

opened 03:59PM - 02 Oct 25 UTC

matthias314

This was observed [on Discourse](https://discourse.julialang.org/t/nerd-sniping-…can-you-make-this-faster/132793/29?u=matthias314). @giordano suggested that I report it here. In the following code, broadcasting with assignment over `FixedSizeArray` is much slower than over `Vector`: ``` using StaticArrays: SVector using FixedSizeArrays: FixedSizeArray const M = 200 struct DotCache{V} x::V y::V z::V end function data_soa_vector() x = rand(SVector{3,Float32}) y = rand(SVector{3,Float32}) dot_cache = DotCache((rand(Float32, M) for _ in 1:3)...) exposed_i = [ rand() > 0.8 for _ in 1:M ] rj_sq = 0.1f0 return x-y, dot_cache, exposed_i, rj_sq end function data_soa_fixed() # converting from Vector to FixedSizeArray deltaxy, dot_cache, exposed_i, rj_sq = data_soa_vector() dot_cache_fixed = DotCache(FixedSizeArray(dot_cache.x), FixedSizeArray(dot_cache.y),FixedSizeArray(dot_cache.z)) exposed_i_fixed = FixedSizeArray(exposed_i) # exposed_i_fixed = exposed_i # with this line instead the benchmarks are identical return deltaxy, dot_cache_fixed, exposed_i_fixed, rj_sq end sumabs2(x, y, z) = @fastmath abs2(x) + abs2(y) + abs2(z) function update_dot_exposure!(deltaxy, dot_cache, exposed_i, rj_sq) @fastmath exposed_i .&= sumabs2.(deltaxy[1] .+ dot_cache.x, deltaxy[2] .+ dot_cache.y, deltaxy[3] .+ dot_cache.z) .>= rj_sq return nothing end ``` Benchmarks: ``` julia> @b update_dot_exposure!($(data_soa_vector())...) 29.512 ns julia> @b update_dot_exposure!($(data_soa_fixed())...) 176.456 ns ``` As indicated in the code above, making `exposed_i_fixed` identical to `exposed_i` fixes the problem. ``` Julia Version 1.11.7 Commit f2b3dbda30a (2025-09-08 12:10 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 32 × Intel(R) Xeon(R) Platinum 8480+ WORD_SIZE: 64 LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids) Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores) ```

matthias314 · October 2, 2025, 4:02pm

I’ve changed the data vectors (length 200) from Vector to FixedSizeArray. The SVector in the code has not been replaced. See the MWE in the GitHub issue I’ve just created.

matthias314 · October 2, 2025, 4:51pm

I am observing a ~40% slowdown with Julia 1.12.0-rc3 compared to 1.11.7, across all approaches (except for the one using FixedSizeArrays.jl, where apparently some Julia bugs were fixed). For example, @lmiq’s current solution goes from 26.637 ns to 40.826 ns. Do other people also see this?

lmiq · October 2, 2025, 4:58pm

Yes, I confirm the regression:

julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (min … max):  31.229 ns … 72.302 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.284 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.424 ns ±  0.680 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▁                              ▁                      ▂▄ ▁
  ████▅▄▄▁▁▁▃▁▁▃▃▄▃▁▁▃▃▁▁▄▃▄▁▃▃▃▁▃▃▇████▇▆▆▅▅▅▄▄▄▅▄▄▃▃▅▄▄▃▁██ █
  31.2 ns      Histogram: log(frequency) by time      32.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vs Nerd-sniping: can you make this faster? - #25 by lmiq

edit: although in my actual application it is much smaller:

% time julia +1.12 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  5.331173 seconds (55.80 M allocations: 1.993 GiB, 21.69% gc time, 3.27% compilation time)

real	0m5,899s
user	0m6,837s
sys	0m0,339s

#vs

% time julia +1.11 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  4.853135 seconds (55.80 M allocations: 1.987 GiB, 17.68% gc time, 0.22% compilation time: 100% of which was recompilation)

real	0m5,400s
user	0m6,273s
sys	0m0,391s

raman_kumar · October 2, 2025, 5:09pm

julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 991 evaluations per sample.
 Range (min … max):  39.168 ns … 130.333 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     44.017 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   45.130 ns ±   6.272 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁   ▆█▅▂▃▃▁▂                                                 ▁
  █▆▇█████████▆▆▆▅▆▆▇▆▆▇▆▆▆▆▅▅▅▅▅▅▅▅▄▄▃▃▃▃▃▁▄▁▃▁▁▁▃▃▁▁▁▄▁▁▁▃▁█ █
  39.2 ns       Histogram: log(frequency) by time        91 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.12.0-rc3
Commit 7522b240144 (2025-09-26 07:42 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 14 × Intel(R) Core(TM) Ultra 7 155U
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)
  GC: Built with stock GC
Threads: 14 default, 1 interactive, 14 GC (on 14 virtual cores)

matthias314 · October 2, 2025, 5:15pm

Out of curiosity: How does the broadcasting approach compare to SIMD in 1.12 for the whole application? For update_dot_exposure! I cannot see any difference anymore.

lmiq · October 2, 2025, 5:26pm

It is somewhat slower:

% time julia +1.12 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  6.570915 seconds (57.20 M allocations: 2.143 GiB, 15.10% gc time, 5.72% compilation time)

real	0m8,321s
user	0m9,217s
sys	0m0,380s

% time julia +1.11 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  6.444734 seconds (57.46 M allocations: 2.149 GiB, 12.27% gc time, 8.98% compilation time: 2% of which was recompilation)

real	0m7,656s
user	0m8,527s
sys	0m0,394s

Or, to be a little bit more clear:

julia> @b sasa(atomic_sasa(read_pdb("6co8.pdb")))
6.005 s (55785174 allocs: 2.071 GiB, 10.89% gc time, without a warmup)

vs

julia> @b sasa(atomic_sasa(read_pdb("6co8.pdb")))
5.005 s (55785174 allocs: 1.992 GiB, 16.77% gc time, without a warmup)

giordano · October 2, 2025, 9:16pm

For the record, it turned out that the slowdown Vector vs FixedSizeVector was real on Julia v1.11, but that’s gone in Julia v1.13 (and actually v1.12 too, I just tested). There are a bunch of compiler issues that we discovered during the development of the package that have been fixed between 1.12 and 1.13.

Topic		Replies	Views
Can I make my code faster with parallelism, or just plain better coding? Performance	33	2889	February 19, 2019
Performance improvements for simple molecular surface caclulation Performance	15	687	September 14, 2020
Optimizing iteration over slices of multiple matrices Performance question , array	14	1948	December 30, 2018
Innefficient paralellization? Need some help optimizing a simple dot product Performance question , parallel	32	4889	March 28, 2018
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2247	March 17, 2021

Nerd-sniping: can you make this faster?

Related topics