Loop over array of static arrays

I have the following loop over arrays (the function double() will obviously be more complicated, but the number of arrays is always the same)


double(x) = 2x
function doubles_using_arrays()
    xs = [ones(10) for i in 1:10]
    for i in 1:9
        xs[i+1] .= double(xs[i])
        # Do some more stuff
    end
    xs
end

I was wondering whether I could accelerate this using static arrays. A naive way would be to simply replace

   xs = [ones(10) for i in 1:10] 

with

    xs = [SA[ones(10)] for i in 1:10]

and not broadcast. The problems is that since the vectors live on the heap, this offers only mild speed gains. The proper (yet still naive) way to take advantage of static arrays would be:

function doubles_using_SA()
    x1 = SA[ones(10)]
    # Do some more stuff
    x2 = double(x1)
    # Do some more stuff
    x3 = double(x2)
    # Do some more stuff
    x4 = double(x3)
    # Do some more stuff
    x5 = double(x4)
    # Do some more stuff
    x6 = double(x5)
    # Do some more stuff
    x7 = double(x6)
    # Do some more stuff
    x8 = double(x7)
    # Do some more stuff
    x9 = double(x8)
    # Do some more stuff
    x10 = double(x9)
    # Do some more stuff
    return x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
end

Obviously, this gets very long, especially given that # Do some more stuff itself contains a few lines. My question is: is there a way to write doubles_using_SA() using a loop, where x1,…,x10 live on the stack, but can be referred to using i and i+1? I imagine this could be done using metaprogramming, but I’ve never used it and given what is said about it, I question whether it is worth the effort and complexity (and could itself add significant computation time).

Yup. The following is about 7x faster for me:

function doubles_using_sarrays()
    xs = [@SVector(ones(10)) for i in 1:10]
    for i in 1:9
        xs[i+1] = double(xs[i])
        # Do some more stuff
    end
    xs
end

Key points:

  1. Don’t call ones(10) and then convert to a StaticArray. Not only does this first heap-allocate a ones(10) array before converting, but it is also type-unstable. The key to efficient use of StaticArrays is to let the compiler know the length of the array. The @SVector macro rewrites the ones(10) into a call to StaticArrays.ones((SVector){10}), which puts the length 10 in the type. β€” this is a common confusion, see e.g. Optimizing function with vector reassignments - #8 by stevengj
  2. Since SVector is immutable, use = instead of .=.

(The only remaining heap allocation is the container xs. You could use a static array for that as well, if you want, but as the size of the object increases the advantage of static arrays is less. And, again, you only get an advantage if you know the length at compile time. Does it really make sense in your application to make the length a constant?)

Putting aside what happens in # Do some more stuff for now, things tend to live on heap so references can share them and their changes. If you don’t need that, I think you could work with StaticArrays.jl’s immutables and reinstantiate instead of mutate, likely easier via Accessors.jl. Thing is, I haven’t really tried doing this for any element bigger than a handful of scalars so I don’t know if things can be compiled to fiddling with some bits in a modest stack-allocated array before copying on return. I tend to go for preallocated arrays in addition to smaller elements.

Update, haven’t bothered to dig into the compiled code, but the performance hit suggests that the compiler doesn’t make it as efficient as an allocated vector. At least it shaves the allocations off. double_using_sarrays also seems a tad faster on v1.10.5 and with only 1 allocation (no separate Memory from Vector), and v1.11.1 is sporadically compiled with 20 allocations instead of 2. A recent thread has pointed out this odd allocation.

julia> function doubles_using_sarrays2()
           xs = @SVector([@SVector(ones(10)) for i in 1:10])
           for i in 1:9
               @reset xs[i+1] = double(xs[i])
               # Do some more stuff
           end
           xs
       end
doubles_using_sarrays2 (generic function with 1 method)

julia> doubles_using_sarrays() == doubles_using_sarrays2()
true

julia> @benchmark doubles_using_sarrays()
BenchmarkTools.Trial: 10000 samples with 958 evaluations.
 Range (min … max):   94.676 ns …   1.385 ΞΌs  β”Š GC (min … max):  0.00% … 75.39%
 Time  (median):     100.522 ns               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   146.475 ns Β± 118.044 ns  β”Š GC (mean Β± Οƒ):  14.76% Β± 16.52%

  β–ˆβ–„β–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–‚β–‚β–„β–ƒβ–‚β–               ▁▁                               ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–ˆβ–‡β–‡β–…β–†β–…β–„β–†β–…β–…β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–‡β–†β–…β–†β–…β–…β–…β–…β–…β–„β–β–„β–…β–…β–„β–„β–„β–ƒβ–„β–„β–„β–…β–…β–„β–… β–ˆ
  94.7 ns       Histogram: log(frequency) by time        731 ns <

 Memory estimate: 928 bytes, allocs estimate: 2.

julia> @benchmark doubles_using_sarrays2()
BenchmarkTools.Trial: 10000 samples with 346 evaluations.
 Range (min … max):  261.561 ns …   1.692 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     274.277 ns               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   332.281 ns Β± 136.370 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–‡β–‚β–†β–‚β–‚β–‚β–β–  ▁▂▁             ▁▂  β–‚  ▁                           ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–‡β–†β–‡β–‡β–†β–†β–…β–…β–†β–…β–„β–„β–…β–„β–„β–…β–…β–ƒβ–„β–„β–… β–ˆ
  262 ns        Histogram: log(frequency) by time        881 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Thanks for the tip. Yes I wasn’t too careful. If I modify my doubles_using_SA() the same way, I still get a ∼4X speed gain compared to doubles_using_sarrays() though (even making xs a MVector).

Thanks. As I understand it, reset creates a modified copy of the object, so it copies the whole xs every time.

For now, I think a feasible solution for me could be to change my algorithm in such a way that I don’t need to keep all arrays:

    x_old = @SVector(ones(10))
    for i in 1:9
        x_new = double(x_old)
        # Accumulate some more stuff using x_new
        x_old = x_new
    end

but I could imagine ways in which this wouldn’t be possible.

Semantically it’s instantiating another value. Whether that involves overhauls or small changes is up to the compiler figuring out what it can get away with to pull off your code. Hypothetically, a compiler could do @reset massivestruct.x += 1 with the same efficient change on the stack as x += 1. StaticArrays.jl does say that its current implementation stresses the compiler and gets slower than heap-allocated Arrays at larger sizes, the pre-1.11 rule of thumb being 100 elements like in your example. Silver lining is there’s room for improvement.

It’s also worth pointing out that returning xs is a problem even with the most efficient implementation on the stack because it needs to copy that array before the stack frame goes, whereas a heap-allocated array only needs its pointer to be copied around. 800 bytes isn’t too bad once, but occupying multiple spots in the call stack or scaling to larger lengths is a real memory hog where there is much less memory (relative to heap) to go around. For a similar reason to variable-length arrays on the stack going the way of the dinosaurs, Julia is designed to lean toward easily scalable collections on the heap and immutable individual values on the stack, assuming it’s unusual to manually make a massive composite.

Before going to StaticArrays, could you try using regular Arrays, but with proper dotting?

xs[i+1] .= double.(xs[i])
2 Likes

To help clarify, that extra dot changes the in-place broadcast kernel, saves the looped allocations done by a scalar-array multiplication in double(xs[i]). But that only trims it from 40 (2 per Array now) to the 22 before the loop in the example.

Thanks. Yes of course. My algorithm (with vectors) modifies in place:

function double!(x_out, x_in)
    x_out .= 2x_in
end

I was just trying to come up with the simplest self-contained illustration of my question.

Why not use ArraysOfArrays.jl?

You have the same problem here, the right-hand-side allocates an intermediate array. Try

x_out .= 2 .* x_in

instead.

I actually don’t think this was nitpicking, btw, your MWE had a fatal flaw that undermined the entire example.

MWEs are, contrary to common opinion, very hard to make, and can be a lot of work.

Thanks for the suggestion. I didn’t know this package. From what I understand, I think it creates normal arrays with views, no matter whether I create them as SArrays or not. But maybe you have something different in mind.

julia> using ArraysOfArrays, StaticArrays;

julia> VA = VectorOfArrays{Float64, 2}();

julia> push!(VA, ones(2, 3));

julia> push!(VA, @SArray(ones(2, 3)));

julia> typeof(VA[1])
Base.ReshapedArray{Float64, 2, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Tuple{}}

julia> typeof(VA[2])
Base.ReshapedArray{Float64, 2, SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, Tuple{}}

Correct, but it’s another way to avoid pointer chasing for inner array

1 Like