Trouble Pre-allocating Outputs

Hello, I was hoping to get some advice regarding how to properly go about Pre-allocating outputs for a more complicated example and speeding a function. I’ve read through others’ posts and the performance guide but was still having some trouble eliminating a garbage problem/performance issues with respect to my do_rotation function.

In my do_rotation function, M_in, B1, M_out are all an 1x3 array so I believe I should be able to do something similar to the example in the performance guide but I am struggling seeing how to best go about it as the way I tried to implement hasn’t done much to improve performance. What would be a good way to go about making this calculation be performed in place? Would it be better to break up this function even further (dividing it at each .+ respectively), i.e. more gate-way functions?

Any suggestions would be greatly appreciate! Eventually I would like to evolve sim into something that can implement it on multi-threads or ideally GPUs so any recommendations at this stage would be greatly appreciated!

Below I’ve included a clip showing the @profiler results and a working copy of the code as well.

using LinearAlgebra

function do_rotation!(M_out::Array{Float64,1},M_in::Array{Float64,1},
                      B1::Array{Float64,1}, θ::Float64)
    M_out .=
        M_in .* cos(θ) .+ cross(B1./norm(B1), M_in) .* sin(θ) .+
        B1./norm(B1) .* dot(B1./norm(B1), M_in) .* (1 - cos(θ))
end
function B1_vec!(B1::Array{Float64,1}, b1::Complex{Float64}, X::Float64,
                 G::Float64,F::Float64,
)
    B1[1] = real(b1)
    B1[2] = imag(b1)
    B1[3] = X*G + F
end
function sim(X::Array{Float64,1},resol::Float64,b1::Array{Complex{Float64},1},G::Array{Float64,1},
             F::Array{Float64,1} ,M::Array{Float64,2},
)
    Output = Array{Float64}(undef, 3, length(X))
    B1 = Array{Float64,1}(undef, 3)
    M_out = Array{Float64,1}(undef,3)
    for x in eachindex(X)
        for tt = 1:length(b1)-1
            B1_vec!(B1, b1[tt], X[x], G[tt], F[tt])

            if abs(norm(B1)*resol) != 0.0
                do_rotation!(M_out,M[:, tt], B1, norm(B1) * resol)
                M[:, tt+1] = M_out
            else
                M[:, tt+1] = M[:, tt]
            end
        end
        Output[:, x] .= M[:, end]
    end
    return Output
end

##
time_pts = 1000;
x_pts = 100;
resol = 4e-5;

X = collect(range(-10/2,10/2,length = x_pts));
B = zeros(ComplexF64,time_pts);
G = ones(time_pts)
F = ones(time_pts)

M = Array{Float64}(undef,3,time_pts);
M[1,:] = zeros(Float64,time_pts);
M[2,:] = zeros(Float64,time_pts);
M[3,1] = 0.5;

Output = @profiler sim(X,resol,B,G,F,M)

First, using @views to avoid allocation at those slicing of arrays together with @inbounds, I could reduce the execution time from 24.354 ms to 20.632 ms. I did not look very carefully, maybe you can tweak a little more, but I am not sure if the performance will be much good.

However, since you are basically dealing with 3x1 vectors, then you should really using StaticArrays.jl. The SVector is a immutable vector allocated in the stack that is much faster if you are working with small arrays. I modified you code to use SVector (hope I did everything right):

using LinearAlgebra
using BenchmarkTools
using StaticArrays

@inline do_rotation(M_in, B1, θ) =
    M_in * cos(θ) + cross(B1/norm(B1), M_in) * sin(θ) + B1/norm(B1) * dot(B1/norm(B1), M_in) * (1 - cos(θ))

@inline B1_vec(b1, X, G, F) = SVector{3,Float64}(real(b1), imag(b1), X*G + F)

function sim(X, resol, b1, G, F, M)

    Output = Vector{SVector{3,Float64}}(undef, length(X))

    @inbounds @views for x in eachindex(X)
        for tt = 1:length(b1)-1
            B1 = B1_vec(b1[tt], X[x], G[tt], F[tt])

            if abs(norm(B1)*resol) != 0.0
                M[tt+1] = do_rotation(M[tt], B1, norm(B1) * resol)
            else
                M[tt+1] = M[tt]
            end
        end
        Output[x] = M[end]
    end

    return Output
end

##
time_pts = 1000;
x_pts = 100;
resol = 4e-5;

X = collect(range(-10/2,10/2,length = x_pts));
B = zeros(ComplexF64,time_pts);
G = ones(time_pts)
F = ones(time_pts)

M = [@SVector zeros(3) for _ = 1:time_pts]
M[1] = @SVector [0.0,0.0,0.5]

@btime sim($X, $resol, $B, $G, $F, $M)

Now, take a look at the benchmark:

julia> include("teste_views.jl")
  2.352 ms (1 allocation: 2.50 KiB)

It is 10x faster. The only “problem” is that your output is a vector of vectors instead of a big matrix. But you can convert it easily.

2 Likes

@Ronis_BR Thank you for taking the time to re-write the code using @views and StaticArrays. Seeing the implemented help a lot with translating the documentation into practical knowledge.

I’m not sure how familiar you are with @threads or CUarrays but with the overall structure of the algorithm, do you think it would be easy to parallelize? My only hesitation is populating M across multiple workers correctly.

Nice!

I am not sure if this can be parallelized. The “slow” part is the do_rotation. However, to compute the i+1-th element, it need that i-th. Hence, I do not see how this can be computed in parallel. Am I missing something?

I do agree that the inner most for loop (for tt…) cannot be parallelized due to needing to know the previous “state”, however the outer most for loop (for x…) is independent of other x’s which should lends it self to parallelization though maybe not in the form its written currently.

1 Like