Avoiding allocations of small but non-trivial arrays (work array alternative?)

Hi everyone,

I’m in the process of optimizing a portion of my code that has a general problem of too much memory allocation, and consequently poor threading efficiency and overall performance. This is a series of pretty simple functions that are called a gazillion times, and each of which needs to use various small temporary arrays.

I’m able to rely on Static Arrays (which, as I understand, get allocated on the stack) but only when the arrays in question can be initialized in a single line of code. However, in other situations, when I need to perform a small computation to build my array, the only solution I found was defining “work” arrays and passing them around. Now, this can be annoying sometimes, because I have several small arrays that I use.

Consider this MWE:

function barw_array( xnodes )
    n = length(xnodes)
    barw = Array{Float64}(undef,n)
    for j ∈ 1:n
        barw[j] = 1.0
        for k ∈ 1:n
            if k != j
                barw[j] = barw[j] / (xnodes[j] - xnodes[k])
            end
        end
    end
    return sum(barw)
end

function run_barw_array()
    xnodes = LinRange(0,1,4)

    for i ∈ 1:100000
        u = barw_array( xnodes )
    end
end

@time run_interp()

This results in too much memory allocation (that grows with the size of the loop).

 0.292006 seconds (2.63 M allocations: 109.113 MiB, 7.85% gc time, 15.11% compilation time)

The work-array version doesn’t have this problem

function barw_workvector( barw, xnodes )
    n = length(xnodes)
    for j ∈ 1:n
        barw[j] = 1.0
        for k ∈ 1:n
            if k != j
                barw[j] = barw[j] / (xnodes[j] - xnodes[k])
            end
        end
    end
    return sum(barw)
end

function run_barw_workvector()
    xnodes = LinRange(0,1,4)

    barw = Array{Float64}(undef,length(xnodes))
    for i ∈ 1:100000
        u = barw_workvector( barw, xnodes )
    end

end

@time run_interp_workvector()
0.005938 seconds (2 allocations: 192 bytes)

Now, I would love to be able to obtain this functionality with StaticArrays.

I’ve been reading some threads in the forum, like this one:

Can Julia optimize mutable static arrays to be allocated on the stack?

but in the proposed solution, they use an MArray in conjunction with a work array. If I try to allocate an MArray in my original function

barw = MVector{n,Float64}

it seems that these vectors are allocated on the heap, and therefore I have the same poor performance as in the original code.

For a related discussion, this was also an interesting thread from 2018: Fortran vs Julia stack-allocated arrays.

I also tried using StaticArrays with generator constructors that call a function, just to see if that worked

using StaticArrays

function xdiff(xnodes,i)
    xd = 1.0
    for k ∈ eachindex(xnodes)
        if k != i
            xd = xd / (xnodes[i] - xnodes[k])
        end
    end
    return xd
end

function sarray_constructor( xnodes )
    n = length(xnodes)
    barw = SVector{n,Float64}( ( xdiff(xnodes,i) for i in 1:n )... )
    s = sum( barw )
    return s
end


function run_barw_sarray_constructor()
    xnodes = LinRange(0,1,4)

    for i ∈ 1:100000
        u = barw_sarray_constructor( xnodes )
    end

end

@time run_barw_sarray_constructor()

This effectively doesn’t allocate. But regardless of the mess that this would be, there is also a performance penalty – benchmarking with @btime I got:

  7.868 ms (0 allocations: 0 bytes)
  4.956 ms (2 allocations: 192 bytes)

Any ideas? Maybe there is a package that enables something along the lines of stack-allocated mutable arrays, or that makes this sort of StaticArray constructors work properly?

Thanks!

1 Like

Could you write the timing lines next to all the timing results? It seems obvious for the most part but I’m not sure what the 7.868ms vs 4.956ms @btimeings are because the preceding code block only contains 1 @time line. Part of my confusion is that you say there is a performance penalty, but the 4.956ms after the SVector code is better than the 0.005938s after the work vector code.

I’m also interested to learn the current state of stack-allocating fixed-size mutables, good topic.

Probably because n is not known at compile time by the function. One alternative is passing it as a Val parameter, then the MArray version will likely be non allocating as well.

edit:, this:

julia> function barw_array( xnodes, ::Val{n} ) where n
           barw = zeros(MVector{n,Float64})
           for j ∈ 1:n
               barw[j] = 1.0
               for k ∈ 1:n
                   if k != j
                       barw[j] = barw[j] / (xnodes[j] - xnodes[k])
                   end
               end
           end
           return sum(barw)
       end
barw_array (generic function with 2 methods)

julia> function run_barw_array()
           xnodes = LinRange(0,1,4)
           n = Val(length(xnodes))
           for i ∈ 1:100000
               u = barw_array( xnodes, n)
           end
       end
run_barw_array (generic function with 1 method)

julia> @btime run_barw_array()
  2.395 ms (0 allocations: 0 bytes)
5 Likes

Maybe off topic, but you can sum directly over the generator, no need for the intermediate array.

2 Likes

Is it possible to allocate a single standard array for all the computations and reuse the memory? If you make your functions mutating which reuse the memory, then you can reduce the allocations to 0 inside the hot loop and just allocate once beforehand.

Even if the array needs to change size, if you know the largest size beforehand, then allocate to the largest and use a view to get a reference to a smaller array using the same memory.

If StaticArrays is really what you want, you can do n = Val(length(xnodes)) as suggested by @lmiq and pass this into a function (so it can know how much memory to allocate on the stack at compile time).

2 Likes

That’s great, this works wonderfully!

I didn’t know about Val parameters, so I will look into them. Here’s the section of the manual for anyone who might be interested: Value types

1 Like

Yes, I think this is what I called a “work” array. There are many problems with this approach, for example, in the context of multi-threaded computations these work arrays can give rise to race conditions.

MArrays and Value-types seem the way to go!

The array is indeed you problem here, but, at least in this MWE, it looks like it is not needed:

using LoopVectorization
function barw_array( xnodes )
    n = length(xnodes)
    sum_barw = 0.0
    @turbo for j ∈ 1:n
        barw_j = 1.0
        for k ∈ 1:n
            if k != j
                barw_j *= (xnodes[j] - xnodes[k])
            end
        end
        sum_barw += 1/barw_j
    end
    return sum_barw
end

This should be much much faster and should not allocate. I also divided the number of divisions by n, which should improve perf a lot too.

Tell me what you think of the perf

1 Like

Oh, thanks, that looks nice!

The MWE was indeed a simplification of what I need to do (I don’t return the sum of barw, but use the array in conjunction with other data), but it is true that I could try to do something similar to what you’ve done in the real case.

There’s a brand new package that seems like made for you. It’s probably not yet registered (you can still use/try it), nor announced:

In theory Julia could heap-allocate for you and free before exit of barw_array, but it would be slower. Or allocate on the stack, but it’s limited so dangerous, you don’t know how large the array is going to be, so I doubt that’s a better policy (or check at runtime the size and only stack-allocate if small).

2 Likes

There’s no need to pass the length of xnodes as a Val, just make the input xnode a static vector, then the compiler will know n statically.

1 Like

Were you able to compile this? I get the following error:

LoadError: Don't know how to handle expression.
if k != j

Loopvectorization doesn’t handle all Julia code. Try splitting the loop into 1:j-1 and j+1:n, perhaps.

Even that might not work, then I’d try moving @turbo to the inner loop.

1 Like

No, unless they are global (only such an array or one you allocate and pass around could have such problems).

The package (@MasonProtter’s) I pointed you to in my last comment does have a global array, and would have problems with threads, except as documented, it also has an array for each thread. Read its docs.

Sure, I will take a look at that package (however, I’m always reluctant to add more dependencies).

Having one array for each thread sounds like a sensible choice.

As a quick update, here’s actually my complete test (which is not much larger than the original one). It’s the barycentric approximation formula, which I’m hoping to specialize for a fixed degree and arbitrary nodes.

I tested 3 versions, the first one without any array allocations (using barw_j instead of barw[j]), the second one using MVectors and Val-types, and the third one hard-coding the case n=4 for a standard StaticArray.

Long story short, MVectors are faster than the version which doesn’t use an array at all (this is surprising), and comes close to the performance of the hard-coded version. In all cases, I’m using multiplications and a single division instead of multiple divisions, which proved faster indeed.

using LoopVectorization
using StaticArrays
using BenchmarkTools

function Interp1d_no_array( xnodes, fvals, t )
    n = length(xnodes)
    numt, denomt = 0, 0

    @inbounds for j ∈ eachindex(xnodes)
        barw_j = 1.0
        for k ∈ 1:n
            if k != j
            barw_j *= (xnodes[j] - xnodes[k])
            end
        end
        barw_j = 1.0 / barw_j
        tdiff = t - xnodes[j]
        numt = numt + barw_j / tdiff * fvals[j]
        denomt = denomt + barw_j / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end

function Interp1d_MVector( xnodes, fvals, t, ::Val{n} ) where n 
    
    barw = zeros(MVector{n,Float64})
    for j ∈ 1:n
        barw[j] = 1.0
        for k ∈ 1:n
            if k != j
                barw[j] = barw[j] * (xnodes[j] - xnodes[k])
            end
        end
        barw[j] = 1.0/barw[j]
    end

    numt, denomt = 0, 0
    @inbounds for j ∈ eachindex(xnodes)
        tdiff = t - xnodes[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end

function Interp1d_Hardcoded( xn, fvals, t )
    
    barw = SA[ 1.0/((xn[1]-xn[2])*(xn[1]-xn[3])*(xn[1]-xn[4])), 1.0/((xn[2]-xn[1])*(xn[2]-xn[3])*(xn[2]-xn[4])), 1.0/((xn[3]-xn[1])*(xn[3]-xn[2])*(xn[3]-xn[4])), 1.0/((xn[4]-xn[1])*(xn[4]-xn[2])*(xn[4]-xn[3])) ]

    numt, denomt = 0, 0
    @inbounds for j ∈ eachindex(xn)
        tdiff = t - xn[j]
        numt = numt + barw[j] / tdiff * fvals[j]
        denomt = denomt + barw[j] / tdiff
        if ( abs(tdiff) < 1e-14 )
            numt = fvals[j]
            denomt = 1.0
            break
        end
    end
    return numt / denomt
end


function run_Interp1d_no_array()
    xnodes = LinRange(0,1,4)
    fvals = xnodes.^4
    t = 0.3
    
    for i ∈ 1:100000
        u = Interp1d_no_array( xnodes, fvals, t )
    end
end

function run_Interp1d_MVector()
    xnodes = LinRange(0,1,4)
    fvals = xnodes.^4
    t = 0.3
    
    n = Val(length(xnodes))
    for i ∈ 1:100000
        u = Interp1d_MVector( xnodes, fvals, t, n )
    end
end

function run_Interp1d_Hardcoded()
    xnodes = LinRange(0,1,4)
    fvals = xnodes.^4
    t = 0.3
    
    for i ∈ 1:100000
        u = Interp1d_Hardcoded( xnodes, fvals, t )
    end
end

@btime run_Interp1d_no_array()
4.481 ms (1 allocation: 96 bytes)

@btime run_Interp1d_MVector()
3.745 ms (1 allocation: 96 bytes)

@btime run_Interp1d_Hardcoded()
3.074 ms (1 allocation: 96 bytes)

So if the MArrays are working for you, I’d advise you to just use that. However, in your actual program will n always be knowable at compile time, or is it dynamic? Because that can severely change the timings. The MVector will also become very slow if n gets too big.

1 Like

Yes, the MArrays are working fine for me now, in combination with Value-types.

n is definitively known at compile time, and is very small (in the range of 4-8). (The reason that performance matters to me is that this routine is called a gazillion times, each time with different data.)

The only thing that I’m leaving for future work is the possibility of using SIMD here. However, I couldn’t get around the if statements which seem to preclude its use. I guess I would probably need to find a way to rewrite the expressions.

1 Like

You can vectorise the loop statement and avoid the branch, the following:

Becomes

for k ∈ 1:n
      barw_j *= (xnodes[j] - xnodes[k]) + (k==j)
 end

Which should get rid of the if statement. I don’t know how efficient it is but it might be worth checking if it works with LoopVectorization

1 Like

Good try, but LoopVectorization.jl doesn’t recognize the expression…

Expression not recognized.
for k = 1:n
    #= interp_turbo.jl:11 =#
    barw[j] = mul_fast(add_fast(sub_fast(xn[j], xn[k]), k == j), barw[j])
    #= interp_turbo.jl:12 =#
end

I have another version that gets rid of the if statement and takes advantage of some symmetry (that also won’t work with LoopVectorization).

    @inbounds for j ∈ 1:n
        for k ∈ j+1:n
            xjk = (xn[j] - xn[k])
            barw[j] *= xjk
            barw[k] *= -xjk
        end
    end

but, curiously, this is yet slower than the first “naive” version with the if statement.

So maybe the compiler is, after all, performing some sort of vectorization in the first place.