Trouble Pre-allocating Outputs

loki · May 17, 2020, 12:59am

Hello, I was hoping to get some advice regarding how to properly go about Pre-allocating outputs for a more complicated example and speeding a function. I’ve read through others’ posts and the performance guide but was still having some trouble eliminating a garbage problem/performance issues with respect to my do_rotation function.

In my do_rotation function, M_in, B1, M_out are all an 1x3 array so I believe I should be able to do something similar to the example in the performance guide but I am struggling seeing how to best go about it as the way I tried to implement hasn’t done much to improve performance. What would be a good way to go about making this calculation be performed in place? Would it be better to break up this function even further (dividing it at each .+ respectively), i.e. more gate-way functions?

Any suggestions would be greatly appreciate! Eventually I would like to evolve sim into something that can implement it on multi-threads or ideally GPUs so any recommendations at this stage would be greatly appreciated!

Below I’ve included a clip showing the @profiler results and a working copy of the code as well.

using LinearAlgebra

function do_rotation!(M_out::Array{Float64,1},M_in::Array{Float64,1},
                      B1::Array{Float64,1}, θ::Float64)
    M_out .=
        M_in .* cos(θ) .+ cross(B1./norm(B1), M_in) .* sin(θ) .+
        B1./norm(B1) .* dot(B1./norm(B1), M_in) .* (1 - cos(θ))
end
function B1_vec!(B1::Array{Float64,1}, b1::Complex{Float64}, X::Float64,
                 G::Float64,F::Float64,
)
    B1[1] = real(b1)
    B1[2] = imag(b1)
    B1[3] = X*G + F
end
function sim(X::Array{Float64,1},resol::Float64,b1::Array{Complex{Float64},1},G::Array{Float64,1},
             F::Array{Float64,1} ,M::Array{Float64,2},
)
    Output = Array{Float64}(undef, 3, length(X))
    B1 = Array{Float64,1}(undef, 3)
    M_out = Array{Float64,1}(undef,3)
    for x in eachindex(X)
        for tt = 1:length(b1)-1
            B1_vec!(B1, b1[tt], X[x], G[tt], F[tt])

            if abs(norm(B1)*resol) != 0.0
                do_rotation!(M_out,M[:, tt], B1, norm(B1) * resol)
                M[:, tt+1] = M_out
            else
                M[:, tt+1] = M[:, tt]
            end
        end
        Output[:, x] .= M[:, end]
    end
    return Output
end

##
time_pts = 1000;
x_pts = 100;
resol = 4e-5;

X = collect(range(-10/2,10/2,length = x_pts));
B = zeros(ComplexF64,time_pts);
G = ones(time_pts)
F = ones(time_pts)

M = Array{Float64}(undef,3,time_pts);
M[1,:] = zeros(Float64,time_pts);
M[2,:] = zeros(Float64,time_pts);
M[3,1] = 0.5;

Output = @profiler sim(X,resol,B,G,F,M)

Ronis_BR · May 17, 2020, 3:05am

First, using @views to avoid allocation at those slicing of arrays together with @inbounds, I could reduce the execution time from 24.354 ms to 20.632 ms. I did not look very carefully, maybe you can tweak a little more, but I am not sure if the performance will be much good.

However, since you are basically dealing with 3x1 vectors, then you should really using StaticArrays.jl. The SVector is a immutable vector allocated in the stack that is much faster if you are working with small arrays. I modified you code to use SVector (hope I did everything right):

using LinearAlgebra
using BenchmarkTools
using StaticArrays

@inline do_rotation(M_in, B1, θ) =
    M_in * cos(θ) + cross(B1/norm(B1), M_in) * sin(θ) + B1/norm(B1) * dot(B1/norm(B1), M_in) * (1 - cos(θ))

@inline B1_vec(b1, X, G, F) = SVector{3,Float64}(real(b1), imag(b1), X*G + F)

function sim(X, resol, b1, G, F, M)

    Output = Vector{SVector{3,Float64}}(undef, length(X))

    @inbounds @views for x in eachindex(X)
        for tt = 1:length(b1)-1
            B1 = B1_vec(b1[tt], X[x], G[tt], F[tt])

            if abs(norm(B1)*resol) != 0.0
                M[tt+1] = do_rotation(M[tt], B1, norm(B1) * resol)
            else
                M[tt+1] = M[tt]
            end
        end
        Output[x] = M[end]
    end

    return Output
end

##
time_pts = 1000;
x_pts = 100;
resol = 4e-5;

X = collect(range(-10/2,10/2,length = x_pts));
B = zeros(ComplexF64,time_pts);
G = ones(time_pts)
F = ones(time_pts)

M = [@SVector zeros(3) for _ = 1:time_pts]
M[1] = @SVector [0.0,0.0,0.5]

@btime sim($X, $resol, $B, $G, $F, $M)

Now, take a look at the benchmark:

julia> include("teste_views.jl")
  2.352 ms (1 allocation: 2.50 KiB)

It is 10x faster. The only “problem” is that your output is a vector of vectors instead of a big matrix. But you can convert it easily.

loki · May 23, 2020, 5:36pm

@Ronis_BR Thank you for taking the time to re-write the code using @views and StaticArrays. Seeing the implemented help a lot with translating the documentation into practical knowledge.

I’m not sure how familiar you are with @threads or CUarrays but with the overall structure of the algorithm, do you think it would be easy to parallelize? My only hesitation is populating M across multiple workers correctly.

Ronis_BR · May 23, 2020, 6:31pm

Nice!

I am not sure if this can be parallelized. The “slow” part is the do_rotation. However, to compute the i+1-th element, it need that i-th. Hence, I do not see how this can be computed in parallel. Am I missing something?

loki · May 23, 2020, 7:44pm

I do agree that the inner most for loop (for tt…) cannot be parallelized due to needing to know the previous “state”, however the outer most for loop (for x…) is independent of other x’s which should lends it self to parallelization though maybe not in the form its written currently.

loki · September 15, 2020, 9:47pm

@Ronis_BR I am sorry if its in appropriate to add to an old topic. I’ve been using the code you posted earlier as a guild to build off of but I ran into a little snag if I wanted to extend the static vector M to vary with position. Using your code I’ve tried to do that below

#1D
M_1d = [[@SArray zeros(3) for _ = 1:x_pts] for _ = 1:time_pts]

for i in 1:x_pts
       M_1d[1][i] = @SVector[0.0,0.0,0.1];
   end

However, this seems like a terrible way to go about dealing with M, especially if I extend this to 2 and 3 dimensions, where I would have something like M[time_pts][x_pts][y_pts]

using LinearAlgebra
using BenchmarkTools
using StaticArrays

@inline do_rotation(M_in, B1, θ) =
    M_in * cos(θ) + cross(B1/norm(B1), M_in) * sin(θ) + B1/norm(B1) * dot(B1/norm(B1), M_in) * (1 - cos(θ))

@inline B1_vec(b1, X, G, F) = SVector{3,Float64}(real(b1), imag(b1), X*G + F)

function sim(X, resol, b1, G, F, M)

    Output = Vector{SVector{3,Float64}}(undef, length(Y),length(X))

    @inbounds @views for x in eachindex(X)
        for tt = 1:length(b1)-1
            B1 = B1_vec(b1[tt], X[x], G[tt], F[tt])

            if abs(norm(B1)*resol) != 0.0
                M[tt+1][x] = do_rotation(M[tt][x], B1, norm(B1) * resol)
            else
                M[tt+1][x] = M[tt][x]
            end
        end
        Output = M[end][x]
    end

    return Output
end

##
time_pts = 1000;
x_pts = 100;
resol = 4e-5;

X = collect(range(-10/2,10/2,length = x_pts));
B = zeros(ComplexF64,time_pts);
G = ones(time_pts)
F = ones(time_pts)

M = [@SVector zeros(3) for _ = 1:time_pts]
M[1] = @SVector [0.0,0.0,0.5]

A sample 2D method would look something like this but again it seems terrible inefficient.


#Sample 2d Method
M_2d = [[[@SArray zeros(3) for _ = 1:x_pts] for _ = 1:y_pts] for _ = 1:time_pts]

for j in 1:y_pts, i in 1:x_pts
    M_2d[1][j][i] = @SVector[0.0,0.0,0.1];
end

function sim_2d(X,Y, resol, b1, G, F, M_2d)

    Output_2d = Array{SVector{3,Float64}}(undef, length(Y),length(X))

    @inbounds @views for y in eachindex(Y), x in eachindex(X)
        for tt = 1:length(b1)-1
            B1 = B1_vec(b1[tt], X[x], G[tt], F[tt])

            if abs(norm(B1)*resol) != 0.0
                M_2d[tt+1][y][x] = do_rotation(M_2d[tt][y][x], B1, norm(B1) * resol)
            else
                M_2d[tt+1][y][x] = M_2d[tt][y][x]
            end
        end
        Output_2d = M[end][y][x]
    end

    return Output
end

Ronis_BR · September 16, 2020, 12:46am

Hi @loki,

Why don’t you just create a matrix of SVectors:

julia> M = [@SArray zeros(3) for i = 1:10, j = 1:10]

which will be accessible by:

julia> M[2,3]
3-element SArray{Tuple{3},Float64,1,3} with indices SOneTo(3):
 0.0
 0.0
 0.0

loki · September 16, 2020, 1:46am

@Ronis_BR Thank you for revisiting this! I didn’t realize I could use that syntax when creating an array like that, and as you saw I was definitely over thinking it.

Do you think a Matrix of SVectors on the order of 10e4x6400 falls out of the use case for staticarrays? There is a warning about excessive compiler time when using large arrays and matrices. I do also know that a Matrix of this size is on the order of 15Gb.

Ronis_BR · September 16, 2020, 1:49am

Notice that you will be creating a Matrix of 3x1 SVectors. The warning is about SVector / SMatrix with big dimensions. However, this is a massive amount of data and you can have problems. Unfortunately, I have never worked with something of this size to give you any good advice.

loki · September 16, 2020, 3:30am

@Ronis_BR You have been a huge help regardless! Unfortunately I operate in the realm of stupidly large matrix sizes so I’ve been trying to learn the ins and outs of Julia’s high performance side. I’ve discovered fairly quickly that its easy to code something that works but is terribly inefficient. Thank you again for the help!

Topic		Replies	Views
Avoiding allocations of small but non-trivial arrays (work array alternative?) Performance question	38	4497	November 17, 2022
Help to reduce number of allocations General Usage	23	5492	January 6, 2019
Optimizing iteration over slices of multiple matrices Performance question , array	14	1896	December 30, 2018
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1109	February 22, 2023
Request guidance in optimising the performance of my implementation (running time and allocation) General Usage question	9	982	April 30, 2018

Trouble Pre-allocating Outputs

Related topics