Efficient operations on vectors of static things

ddcampayo · June 9, 2024, 9:19am

Hello.

I am (still) trying to make sense on how to efficiently traverse a vector containing static vectors and matrices and doing operations on them. This is in the context of molecular dynamics, where I have N particles, with vectors on them (positions, velocities), and matrices (e.g. inertia tensors).

So, to compute the norm of all vectors I have the naïve implementation (function r2) , which is greatly sped-up with the simple use of an inner variable r_i = r[i] (function r2_v2). Using a regular vector for this (function r2_v3) also speeds things up, but not as greatly. All this, due to fewer allocations, I guess.

  174.579 μs (4490 allocations: 140.58 KiB)
  589.067 ns (1 allocation: 7.94 KiB)
  8.175 μs (2 allocations: 8.02 KiB)

But then, when I compute the operation r·J·r (which I need, long story…), these tricks do not work. The number of allocations does not go down. Why?

  32.308 μs (1490 allocations: 31.20 KiB)
  46.945 μs (1979 allocations: 38.84 KiB)
  704.404 μs (6493 allocations: 328.33 KiB)

Here is the code. Thanks for any hints whatsoever.

using StaticArrays

using LinearAlgebra

using BenchmarkTools

N = 1000 ;

rr  =  Vector{SVector{2}}(undef,N);

for i in 1:N rr[i] = rand(2) end

function r2(  r  ) 
    N = length(r)

    rr = Vector{Float64}(undef,N)

    for i in 1:N
        rr[i] = dot( r[i] , r[i] )
    end

    return rr

end

@btime r2($rr) ;

function r2_v2(  r  ) 
    N = length(r)

    rr = Vector{Float64}(undef,N)

    for i in 1:N
        r_i = rr[i]
        rr[i] = dot( r_i , r_i )
    end

    return rr

end

@btime r2_v2($rr) ;

function r2_v3(  r  ) 
    N = length(r)

    rr = Vector{Float64}(undef,N)
    r_i = zeros(2)
    
    for i in 1:N
        r_i .= rr[i]
        rr[i] = dot( r_i , r_i )
    end

    return rr

end

@btime r2_v3($rr) ;

JJ  =  Vector{SMatrix{2, 2}}(undef,N);

for i in 1:N JJ[i] = rand(2,2) end

function rJr(  r , J ) 
    N = length(r)

    rjr = Vector{Float64}(undef,N)

    for i in 1:N
        rjr[i] = dot( r[i] , J[i], r[i] )
    end

    return rjr

end

@btime rJr($rr , $JJ) ;

function rJr_v2(  r , J ) 
    N = length(r)

    rjr = Vector{Float64}(undef,N)

    for i in 1:N
        r_i = rr[i]
        J_i = J[i]

        rjr[i] = dot( r_i , J_i, r_i )
    end

    return rjr

end

@btime rJr_v2($rr , $JJ) ;

function rJr_v3(  r , J ) 
    N = length(r)

    rjr = Vector{Float64}(undef,N)

    r_i = zeros(2)
    J_i = zeros(2,2)
    Jr_i = zeros(2)

    for i in 1:N
        r_i .= rr[i]
        J_i .= J[i]
        Jr_i.= J_i * r_i
        rjr[i] = dot( r_i , Jr_i )
    end

    return rjr

end

@btime rJr_v3($rr , $JJ) ;

lmiq · June 9, 2024, 9:44am

This is an abstract type. You need SMatrix{2,2,Float64,4}, the 4 being redundant but unfortunately necessary.

lmiq · June 9, 2024, 9:51am

Then this is

JJ = rand(SMatrix{2,2,Float64,4}, 1000)

ddcampayo · June 9, 2024, 5:13pm

Thank you. Yes, I had been fighting with an elegant initialization for JJ, and had to settle for a loop

Anyway, your suggestion only seems to help with version 3 of the r·J·r code:

  36.568 μs (2490 allocations: 78.08 KiB)
  50.403 μs (2979 allocations: 85.72 KiB)
  141.808 μs (2493 allocations: 125.20 KiB)

It seems also strange that SVector should work, but not SMatrix, but these things do happen.

lmiq · June 9, 2024, 5:34pm

This is also abstract and should be SVector{2,Float64}

lmiq · June 9, 2024, 5:39pm

I’m on the phone, but the only allocations you should see on those loops, types fixed to be concrete, are in this line of the last example, because the product on the right will allocate a new array for these non-static vectors and matrices (to eliminate that one you would need do use mul!).

ddcampayo · June 9, 2024, 6:02pm

Hi, and thanks again for the two suggestions. The suggestion about turning SVector{2} into SVector{2,Float64} has a great effect:

  971.286 ns (1 allocation: 7.94 KiB)
  558.457 ns (1 allocation: 7.94 KiB)
  8.301 μs (2 allocations: 8.02 KiB)
  4.077 μs (1 allocation: 7.94 KiB)
  54.752 μs (3979 allocations: 116.97 KiB)
  142.932 μs (3493 allocations: 156.45 KiB)

Even tough the numbers are exciting, it is still not clear to me that the function with the inner r_i, r2_v2(r) is slightly better, and does not allocate — while the
one with inner r_i and J_i, rJr_v2(r) is allocating like crazy.

You can see rJr_v3(r) is also not much better.

IlianPihlajamaa · June 9, 2024, 6:07pm

Be careful, all your v2 and v3 versions of the code dont do what you want to. Make sure when optimizing that you dont change the end result.

In particular, in your inner loops, you do r_i .= rr[i], where you should do r_i .= r[i]. I think that should solve the weird performance results (if you also fix the type of the static vector as pointed out above).

ddcampayo · June 9, 2024, 7:20pm

Oh, man, I can’t believe those globals have been messing with my functions again! The “r2” functions where ok, because the “rr” inside them overrode the global one. Not so for the rJr, where “rr” is a total bug.

Thank you very much, now results make complete sense

  668.019 ns (1 allocation: 7.94 KiB)
  594.266 ns (1 allocation: 7.94 KiB)
  8.133 μs (2 allocations: 8.02 KiB)
  3.779 μs (1 allocation: 7.94 KiB)
  3.703 μs (1 allocation: 7.94 KiB)
  76.301 μs (1004 allocations: 86.31 KiB)

And, seriously, is there a way to stop global variables from entering functions? I have been appending “gl_” to all my global variables in my code, but sometimes (as in this example) I forget.

IlianPihlajamaa · June 9, 2024, 8:03pm

Yes there are ways. Either you

create a main() function and run that so you dont have any globals
avoid using the same variable names in your functions as existing globals (avoid name shadowing). Perhaps you can use some naming convention like putting globals in all capitals or appending _global to the name (as you are doing)
Sometimes cleaning up your globals by putting them in structs or named tuples can also help to limit the surface for these types of issues.

But (as far as i know) there is no macro or syntax to prohibit a function from using any (untyped) global variables.

ddcampayo · June 9, 2024, 9:42pm

Thank you for your suggestions, and thanks to all for helping. Marking this one as solved. Sorry, I can only mark one answer as “solution”, but the global variable stuff is really my fault, and a bug.

Elrod · June 10, 2024, 12:18am

ddcampayo:

    rr = Vector{Float64}(undef,N)

    for i in 1:N
        r_i = rr[i]
        rr[i] = dot( r_i , r_i )
    end

    return rr

end

@btime r2_v2($rr) ;

function r2_v3(  r  ) 
    N = length(r)

    rr = Vector{Float64}(undef,N)
    r_i = zeros(2)
    
    for i in 1:N
        r_i .= rr[i]

These implementations are wrong.

Eben60 · June 10, 2024, 11:27am

Has been repeatedly bitten by globals forgotten in the body of a function. One of the ways of preventing it is (temporarily) putting your function into a let block:

foo = let 
    function foo(x)
        return x
    end
end

But actually, after inflicting myself more pain that was necessary, I came to conclusion that putting all my functions into a package from the very beginning, while keeping globals outside of it, solves this problem, too.

Topic		Replies	Views
High performance vector/matrix/tensor linear algebra operations Performance question , performance , linearalgebra	9	529	January 20, 2023
Correct way to benchmark StaticVectors General Usage benchmarktools , staticarrays	4	148	September 13, 2024
Performance optimization on lots of small linear algebra operations Performance question , staticarrays	13	1006	July 8, 2022
Optimizing iteration over slices of multiple matrices Performance question , array	14	1869	December 30, 2018
Multiplication of vector of matrices and vector of vectors Performance linearalgebra , tullio	3	411	November 7, 2022

Efficient operations on vectors of static things

Related topics