Struggling to get Julia as fast as Matlab mex. (Filling large matrix, doing pairwise calculations over particles)

I have a system with Ntot particles. I am creating a matrix Mh that is 6Ntot x 6Ntot and symmetric, and elements are given by pairwise calculations between particles.

I’ve tried my best to speed up my Julia code, but feel like I’m missing something. For Ntot = 500, Matlab mex time = 0.087 s and Julia = 0.18 s. The function is being called every timestep of a simulation so the difference is adding up.

The function is essentially: loop over all particles pairwise, calculate separation rij, calculate submatrices Mtt, Mtr, Mrt, Mrr and place into matrix Mh. I’ve put some code below. X3 is a 3Ntot length vector (particle positions), ‘a’ is an Ntot length vector (particle radii). Any help greatly appreciated.

using LinearAlgebra, BenchmarkTools

function skew!(s,a)
    s[1,2] = -a[3]
    s[1,3] = a[2]
    s[2,1] = a[3]
    s[2,3] = -a[1]
    s[3,1] = -a[2]
    s[3,2] = a[1]
end

function calc_Mh!(Mh,X3,Ntot,a)
    Mtt = zeros(3,3)
    Mtr = zeros(3,3)
    Mrt = zeros(3,3)
    Mrr = zeros(3,3)
    rhat_skew = zeros(3,3)
    rhat = zeros(3)
    rij = zeros(3)
    Ii = Matrix(I,3,3)
    rhat_op = zeros(3,3)
    for i = 1:Ntot
        xi = @views X3[3(i-1)+1:3*i];        
        for j = i+1:Ntot
            xj = @views X3[3*(j-1)+1:3*j];

            rij .= xi.-xj
            r = norm(rij)
            rhat .= rij./r
            skew!(rhat_skew,-rhat)

            sqrd = (a[i]^2 + a[j]^2)/r^2
            rhat_op .= rhat*rhat'

            Mtt  .= 1/(2*r) * ((1 + sqrd/3)*I + (1-sqrd)*rhat_op)
            Mtr  .= 1/(2*r^2)*rhat_skew
            Mrt  .= Mtr
            Mrr  .= 1/(4*r^3) * (3*rhat_op - I)

            Mh[3*(i-1)+1:3*i, 3*(j-1)+1:3*j] .= Mtt
            Mh[3*(i-1)+1:3*i, 3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j] .= Mtr
            Mh[3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i, 3*(j-1)+1:3*j] .= Mrt
            Mh[3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i, 3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j] .= Mrr
        
            Mh[3*(j-1)+1:3*j, 3*(i-1)+1:3*i] .= Mtt'
            Mh[3*(j-1)+1:3*j, 3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i] .= Mtr'
            Mh[3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j, 3*(i-1)+1:3*i] .= Mrt'
            Mh[3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j, 3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i] .= Mrr'
        end
        
        Mtt .= 2/(3*a[i])*Ii
        Mrr .= 1/(2*a[i]^3)*Ii        
        Mh[3*(i-1)+1:3*i, 3*(i-1)+1:3*i] .= Mtt
        Mh[3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i, 3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i] .= Mrr
    end    
    Mh .= Mh * 1/(4*pi);
    return Mh
end

Ntot = 500
a = 0.05*ones(Ntot)
X3 = 10*rand(3*Ntot)
Mh  = zeros(Ntot*6,Ntot*6)
@btime calc_Mh!(Mh,X3,Ntot,a)
1 Like

Before reinventing the wheel, check the Distances.jl package to see if all you need is pairwise distances. They have efficient implementations that are widely tested for different distance types.

3 Likes

I think this is allocating an intermediate array at every loop iteration.

It is possible that you can have a faster and cleaner code using StaticArrays for all these small matrices and vectors that you need.

2 Likes
julia> @btime calc_Mh!(Mh,X3,Ntot,a)`
  99.728 ms (1123763 allocations: 268.80 MiB)

You are allocating way too much for this to be really fast. The preallocated variables are a great start but e.g.

Mtt .= 2/(3*a[i]) * Ii

should be

Mtt .= 2/(3*a[i]) .* Ii

to get fully fused and avoid intermediate storage.

If you already have a mex implementation as C code it’s probably very straightforward to port that code to Julia and I expect you will see a similar performance.

2 Likes

Thanks for pointing out this package, this looks useful! Although I’m not sure if I can take advantage too much because I also need to calculate other pairwise quantities that depend on the separation vector rather than just Euclidean distance, so I’m not sure if the methods they use in that package can be leveraged fully. Although I may be able to pre-calculate as many quantities as possible using that package before entering my for loops. Cheers!

using StaticArrays

skew(a) = @SMatrix [0 -a[3] a[2]; a[3] 0 -a[1]; -a[2] a[1] 0]

function calc_Mh_StaticArrays!(Mh,X3,Ntot,a)
	I3 = Diagonal(SVector(true,true,true))
    for i = 1:Ntot
        xi = @views SVector{3}(X3[3(i-1)+1:3*i]);
        for j = i+1:Ntot
            xj = @views SVector{3}(X3[3*(j-1)+1:3*j]);

            rij = xi.-xj
            r = norm(rij)
            rhat = rij/r
            rhat_skew = skew(-rhat)

            sqrd = (a[i]^2 + a[j]^2)/r^2
            rhat_op = rhat*rhat'

            Mtt  = 1/(2*r) * ((1 + sqrd/3)*I + (1-sqrd)*rhat_op)
            Mtr  = 1/(2*r^2)*rhat_skew
            Mrt  = Mtr
            Mrr  = 1/(4*r^3) * (3*rhat_op - I)

            Mh[3*(i-1)+1:3*i, 3*(j-1)+1:3*j] = Mtt
            Mh[3*(i-1)+1:3*i, 3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j] = Mtr
            Mh[3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i, 3*(j-1)+1:3*j] = Mrt
            Mh[3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i, 3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j] = Mrr

            Mh[3*(j-1)+1:3*j, 3*(i-1)+1:3*i] = Mtt'
            Mh[3*(j-1)+1:3*j, 3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i] = Mtr'
            Mh[3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j, 3*(i-1)+1:3*i] = Mrt'
            Mh[3*(Ntot)+3*(j-1)+1:3*(Ntot)+3*j, 3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i] = Mrr'
        end

        Mtt = 2/(3*a[i])*I3
        Mrr = 1/(2*a[i]^3)*I3
        Mh[3*(i-1)+1:3*i, 3*(i-1)+1:3*i] = Mtt
        Mh[3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i, 3*(Ntot)+3*(i-1)+1:3*(Ntot)+3*i] = Mrr
    end
    Mh .= Mh .* (1/(4*pi));
    return Mh
end
julia> @btime calc_Mh!($Mh,$X3,$Ntot,$a);
  221.261 ms (1123763 allocations: 268.80 MiB)

julia> @btime calc_Mh_StaticArrays!($Mh,$X3,$Ntot,$a);
  28.496 ms (0 allocations: 0 bytes)

8x speedup with a pretty straightforward conversion to StaticArrays.jl. Some of that could have been saved by being better about broadcasting in your initial implementation, but for these 3x3 matrices you won’t beat StaticArrays.

4 Likes

Can you disregard distances greater than cutoff? If so check: GitHub - m3g/CellListMap.jl: Flexible implementation of cell lists to map the calculations of particle-pair dependent functions, such as forces, energies, neighbour lists, etc.

1 Like

@GunnarFarneback
Thanks, I didn’t realise I was missing the .* for Float * Matrix operations. Went through and added that throughout and helped a lot, cheers

@mikmoore
Thanks a lot, I hadn’t seen static arrays before but this looks great, I’ll read up on this and start implementing

@Imiq
Sadly I can’t disregard distances greater than a cutoff, although it’s good to be aware of that package for future use!

1 Like