Nerd-sniping: can you make this faster?

I’ve pulled out the deltaxy calculation, so it could still be unfair there. But in practice (in the actual application), I get, with this last version:

julia> @b atomic_sasa(prot; parallel=false)
13.640 ms (67044 allocs: 2.888 MiB)

while with Zentrik’s I get:

julia> @b atomic_sasa(prot; parallel=false)
9.760 ms (67044 allocs: 2.888 MiB)

FWIW, this is what we gained so far:

julia> @b atomic_sasa(prot; parallel=false)
80.444 ms (65546 allocs: 2.248 MiB)

(not irreleveant at all, since we need to run this calculation over thousands of frames of MD trajectories)

1 Like

Not sure how much of a help it would be (if any), but

\sum_{i=1}^n (\delta_i + c_i^j)^2 > R^2, \forall j

can be also written as

\sum_{i=1}^n (2\delta_i + c_i^j) c_i^j > R^2 - \sum_{i=1}^n \delta_i^2, \forall j.

The right-hand side does not depend on j, so it needs to be computed only once. I suppose the left-hand side would translate to

lhs = ((dot_cache.x[ind] + deltaxy[1]) * deltaxy[1] + 
       (dot_cache.y[ind] + deltaxy[2]) * deltaxy[2] + 
       (dot_cache.z[ind] + deltaxy[3]) * deltaxy[3])

and the right-hand side would be fixed to

rhs = rj_sq - sum(abs2, deltaxy)

Edit: And obviously the condition of interest:

exposed_i[ind] &= (lhs > rhs)
1 Like

Nice catch, but there was no speedup:

julia> @benchmark atomic_sasa($prot; parallel=false)
BenchmarkTools.Trial: 507 samples with 1 evaluation per sample.
 Range (min … max):  9.351 ms …  17.197 ms  β”Š GC (min … max): 0.00% … 41.86%
 Time  (median):     9.491 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.859 ms Β± 820.727 ΞΌs  β”Š GC (mean Β± Οƒ):  2.58% Β±  6.11%

  β–ƒβ–ˆβ–‡β–„β–ƒβ–‚β–‚β–‚β–β–‚                                       ▁           
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–…β–†β–ˆβ–ˆβ–‡β–‡β–‡β–‡β–„β–…β–…β–„β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–…β–…β–…β–‡β–ˆβ–ˆβ–‡β–‡β–†β–†β–„β–β–…β–… β–‡
  9.35 ms      Histogram: log(frequency) by time      12.2 ms <

 Memory estimate: 2.89 MiB, allocs estimate: 67044.

vs (previously):

julia> @benchmark atomic_sasa($prot; parallel=false)
BenchmarkTools.Trial: 510 samples with 1 evaluation per sample.
 Range (min … max):  9.306 ms …  16.500 ms  β”Š GC (min … max): 0.00% … 40.93%
 Time  (median):     9.469 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   9.824 ms Β± 829.032 ΞΌs  β”Š GC (mean Β± Οƒ):  2.53% Β±  6.20%

  β–β–ˆβ–„                                                          
  β–ˆβ–ˆβ–ˆβ–…β–„β–„β–„β–ƒβ–„β–ƒβ–„β–ƒβ–…β–…β–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–ƒβ–‚β–ƒβ–ƒβ–‚β–‚β–β–‚β–ƒ β–ƒ
  9.31 ms         Histogram: frequency by time        12.2 ms <

 Memory estimate: 2.89 MiB, allocs estimate: 67044.

Btw, note that by using StructArrays.jl you can conveniently view the SoA also as an AoS.

s = StructArray{SVector{3,Float32}}((dot_cache.x, dot_cache.y, dot_cache.z))
200-element StructArray(::Vector{Float32}, ::Vector{Float32}, ::Vector{Float32}) with eltype SVector{3, Float32}:
 [0.54684204, 0.0305565, 0.19615191]
 [0.12136942, 0.9855579, 0.84889585]
...
1 Like

I’m settling down with this version, based on Zentrik’s suggestion:

using SIMD: VecRange
using StaticArrays
using BenchmarkTools

function update_dot_exposure!(deltaxy, dot_cache, exposed_i, rj_sq, ::Val{N}) where {N}
    lastN = N * (length(exposed_i) Γ· N)
    lane = VecRange{N}(0)
    @inbounds for i in 1:N:lastN
        if any(exposed_i[lane + i])
            pos_x = dot_cache.x[lane + i] + deltaxy[1]
            pos_y = dot_cache.y[lane + i] + deltaxy[2]
            pos_z = dot_cache.z[lane + i] + deltaxy[3]
            exposed_i[lane + i] &= sum(abs2, (pos_x, pos_y, pos_z)) >= rj_sq
        end
    end
    # Remaining 
    @inbounds for i in lastN+1:length(exposed_i)
        pos_x = dot_cache.x[i] + deltaxy[1]
        pos_y = dot_cache.y[i] + deltaxy[2]
        pos_z = dot_cache.z[i] + deltaxy[3]
        exposed_i[i] &= sum(abs2, (pos_x, pos_y, pos_z)) >= rj_sq
    end
    return exposed_i
end

struct DotCache{T}
    x::Vector{T}
    y::Vector{T}
    z::Vector{T}
end

function data_soa()
    x = rand(SVector{3,Float32}) 
    y = rand(SVector{3,Float32})
    dot_cache = DotCache(rand(Float32,200), rand(Float32,200), rand(Float32,200))
    exposed_i=[ rand() > 0.8 for _ in 1:200 ]
    rj_sq=0.1f0
    N=Val(16)
    return x-y, dot_cache, exposed_i, rj_sq, N
end

With the following performance:

julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (min … max):  22.367 ns … 47.081 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     23.172 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   23.140 ns Β±  0.437 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

             β–…β–†         β–ƒβ–ˆβ–†          β–‚                        β–‚
  β–‡β–‡β–β–β–β–β–β–β–β–ƒβ–β–ˆβ–ˆβ–ˆβ–…β–†β–„β–„β–…β–†β–…β–„β–ˆβ–ˆβ–ˆβ–†β–†β–„β–„β–…β–†β–„β–…β–…β–†β–ˆβ–ˆβ–„β–†β–†β–†β–…β–„β–…β–…β–…β–…β–‡β–†β–…β–‡β–ˆβ–ˆβ–ˆβ–‡β–†β–†β–†β–‡ β–ˆ
  22.4 ns      Histogram: log(frequency) by time      24.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Feel free to squeeze it further :slight_smile:

Thanks all for the help!

2 Likes

If one consistently uses the fixed-length vector types from SmallCollections.jl (disclaimer: I’m the author), then the easy-peasy broadcasting version becomes noticeably faster:

julia> @b update_dot_exposure!($(data_soa())...)  # lmiq's latest version
28.198 ns

julia> @b update_dot_exposure_small!($(data_soa_small())...) 
18.962 ns
Code
using SmallCollections: FixedVector, MutableFixedVector

struct DotCacheSmall{V}
    x::V
    y::V
    z::V
end

function data_soa_small()
    x = rand(FixedVector{3,Float32})
    y = rand(FixedVector{3,Float32})
    M = 200
    dot_cache = DotCacheSmall((FixedVector{M}(rand(Float32, M)) for _ in 1:3)...)
    exposed_i = MutableFixedVector{M}([ rand() > 0.8 for _ in 1:M ])
    rj_sq = 0.1f0
    return x-y, dot_cache, exposed_i, rj_sq
end

sumabs2(x, y, z) = @fastmath abs2(x) + abs2(y) + abs2(z)

function update_dot_exposure_small!(deltaxy, dot_cache, exposed_i, rj_sq)
    @fastmath exposed_i .&= sumabs2.(deltaxy[1] .+ dot_cache.x, deltaxy[2] .+ dot_cache.y, deltaxy[3] .+ dot_cache.z) .>= rj_sq
    return nothing
end

Some comments:

  • This requires the current GitHub version SmallCollections#master. The published version has a missing @inline that spoils performance.
  • For some reason this doesn’t work with StaticArrays.jl.
  • Now the length of the data vectors is fixed. This is of course not what one wants, but one might be able to chop the data into chunks of fixed size, say 128 or 256.

EDIT: Here is a proof of concept with a wrapper ChunkedVector (with chunk size 64). This now works for arbitrary data vectors.

New code
const M = 200  # length of data vectors
const C = 64   # chunk size

using SmallCollections: FixedVector, MutableFixedVector

# chunked vectors

struct ChunkedVector{N,T} <: AbstractVector{T}
    chunks::Vector{MutableFixedVector{N,T}}
    n::Int
end

function ChunkedVector{N}(w::AbstractVector{T}) where {N,T}
    m = (length(w)-1) Γ· N + 1
    v = ChunkedVector([zero(MutableFixedVector{N,T}) for _ in 1:m], length(w))
    copyto!(v, w)
end

Base.size(v::ChunkedVector) = (v.n,)

function Base.getindex(v::ChunkedVector{N}, i::Int) where N
    p, q = divrem(i-1, N)
    v.chunks[p+1][q+1]
end

function Base.setindex!(v::ChunkedVector{N}, x, i::Int) where N
    p, q = divrem(i-1, N)
    v.chunks[p+1][q+1] = x
end

#

struct DotCacheSmall{V}
    x::V
    y::V
    z::V
end

sumabs2(x, y, z) = @fastmath abs2(x) + abs2(y) + abs2(z)

function update_dot_exposure_chunked!(deltaxy, dot_cache, exposed_i, rj_sq)
    for (x, y, z, e) in zip(dot_cache.x.chunks, dot_cache.y.chunks, dot_cache.z.chunks, exposed_i.chunks)
        @fastmath e .&= sumabs2.(deltaxy[1] .+ x, deltaxy[2] .+ y, deltaxy[3] .+ z) .>= rj_sq
    end
end

function data_soa_chunked()
    x = rand(FixedVector{3,Float32})
    y = rand(FixedVector{3,Float32})
    dot_cache = DotCacheSmall((ChunkedVector{C}(rand(Float32, M)) for _ in 1:3)...)
    exposed_i = ChunkedVector{C}([ rand() > 0.8 for _ in 1:M ])
    rj_sq = 0.1f0
    return x-y, dot_cache, exposed_i, rj_sq
end

Performance varies depending on the length. For vectors of length 200 I get

julia> @b update_dot_exposure!($(data_soa())...)
28.421 ns

julia> @b update_dot_exposure_chunked!($(data_soa_chunked())...)
25.395 ns

and for length 300

julia> @b update_dot_exposure!($(data_soa())...) 
40.457 ns

julia> @b update_dot_exposure_chunked!($(data_soa_chunked())...)
31.273 ns
1 Like

You mean the M of the first code? I don’t think I need more flexibility than that (only perhaps because it might stress the compiler if the length varies across executions?)

Ps. Is 200 a small collection? Or we might start having compilation time issues as with StaticArrays?

There was that other package, I think, FixedSizeArrays.jl , can that help?

Yes, that’s the length of the data vectors in the first code.

Is 200 a small collection? Or we might start having compilation time issues as with StaticArrays?

It’s not small in the sense of SmallCollections.jl, so I’m surprised myself that it leads to fast code. But I did notice long compilation times, more than for the chunked version.

FixedSizeArrays.jl , can that help?

That would be worth trying out!

Turns out that it makes the broadcasting approach much slower.

1 Like

Can you open an issue with a minimal reproducer in the repo?

But also, much slower…than what? What did you swap for what? I’ve been trying to look at this myself, but there are so many different code examples in this thread that I have no idea of what I’m looking at.

If you swapped StaticArrays with FixedSizeArrays, I tried hard to explain in the docs that FixedSizeArrays.jl is not a replacement for StaticArrays.jl, the only property they share is not being able to resize the arrays, for the rest they’re completely different. FixedSizeArray is supposed to be very close to Array, but not resizable, which is also what the JuliaCon talk tried to convey already from the title.

I’ve changed the data vectors (length 200) from Vector to FixedSizeArray. The SVector in the code has not been replaced. See the MWE in the GitHub issue I’ve just created.

I am observing a ~40% slowdown with Julia 1.12.0-rc3 compared to 1.11.7, across all approaches (except for the one using FixedSizeArrays.jl, where apparently some Julia bugs were fixed). For example, @lmiq’s current solution goes from 26.637 ns to 40.826 ns. Do other people also see this?

Yes, I confirm the regression:

julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (min … max):  31.229 ns … 72.302 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     31.284 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   31.424 ns Β±  0.680 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–ˆβ–‡β–                              ▁                      β–‚β–„ ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–…β–„β–„β–β–β–β–ƒβ–β–β–ƒβ–ƒβ–„β–ƒβ–β–β–ƒβ–ƒβ–β–β–„β–ƒβ–„β–β–ƒβ–ƒβ–ƒβ–β–ƒβ–ƒβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–…β–…β–„β–„β–„β–…β–„β–„β–ƒβ–ƒβ–…β–„β–„β–ƒβ–β–ˆβ–ˆ β–ˆ
  31.2 ns      Histogram: log(frequency) by time      32.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

vs Nerd-sniping: can you make this faster? - #25 by lmiq

edit: although in my actual application it is much smaller:

% time julia +1.12 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  5.331173 seconds (55.80 M allocations: 1.993 GiB, 21.69% gc time, 3.27% compilation time)

real	0m5,899s
user	0m6,837s
sys	0m0,339s

#vs

% time julia +1.11 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  4.853135 seconds (55.80 M allocations: 1.987 GiB, 17.68% gc time, 0.22% compilation time: 100% of which was recompilation)

real	0m5,400s
user	0m6,273s
sys	0m0,391s
1 Like
julia> @benchmark update_dot_exposure!($(data_soa())...)
BenchmarkTools.Trial: 10000 samples with 991 evaluations per sample.
 Range (min … max):  39.168 ns … 130.333 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     44.017 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   45.130 ns Β±   6.272 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  ▁   β–†β–ˆβ–…β–‚β–ƒβ–ƒβ–β–‚                                                 ▁
  β–ˆβ–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–†β–…β–†β–†β–‡β–†β–†β–‡β–†β–†β–†β–†β–…β–…β–…β–…β–…β–…β–…β–…β–„β–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–β–„β–β–ƒβ–β–β–β–ƒβ–ƒβ–β–β–β–„β–β–β–β–ƒβ–β–ˆ β–ˆ
  39.2 ns       Histogram: log(frequency) by time        91 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.12.0-rc3
Commit 7522b240144 (2025-09-26 07:42 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 14 Γ— Intel(R) Core(TM) Ultra 7 155U
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, alderlake)
  GC: Built with stock GC
Threads: 14 default, 1 interactive, 14 GC (on 14 virtual cores)
1 Like

Out of curiosity: How does the broadcasting approach compare to SIMD in 1.12 for the whole application? For update_dot_exposure! I cannot see any difference anymore.

It is somewhat slower:

% time julia +1.12 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  6.570915 seconds (57.20 M allocations: 2.143 GiB, 15.10% gc time, 5.72% compilation time)

real	0m8,321s
user	0m9,217s
sys	0m0,380s

% time julia +1.11 --project -e "using PDBTools; @time(sasa(atomic_sasa(read_pdb(\"6co8.pdb\"))))"
  6.444734 seconds (57.46 M allocations: 2.149 GiB, 12.27% gc time, 8.98% compilation time: 2% of which was recompilation)

real	0m7,656s
user	0m8,527s
sys	0m0,394s

Or, to be a little bit more clear:

julia> @b sasa(atomic_sasa(read_pdb("6co8.pdb")))
6.005 s (55785174 allocs: 2.071 GiB, 10.89% gc time, without a warmup)

vs

julia> @b sasa(atomic_sasa(read_pdb("6co8.pdb")))
5.005 s (55785174 allocs: 1.992 GiB, 16.77% gc time, without a warmup)
1 Like

For the record, it turned out that the slowdown Vector vs FixedSizeVector was real on Julia v1.11, but that’s gone in Julia v1.13 (and actually v1.12 too, I just tested). There are a bunch of compiler issues that we discovered during the development of the package that have been fixed between 1.12 and 1.13.

4 Likes