I have the following loop over arrays (the function double() will obviously be more complicated, but the number of arrays is always the same)
double(x) = 2x
function doubles_using_arrays()
xs = [ones(10) for i in 1:10]
for i in 1:9
xs[i+1] .= double(xs[i])
# Do some more stuff
end
xs
end
I was wondering whether I could accelerate this using static arrays. A naive way would be to simply replace
xs = [ones(10) for i in 1:10]
with
xs = [SA[ones(10)] for i in 1:10]
and not broadcast. The problems is that since the vectors live on the heap, this offers only mild speed gains. The proper (yet still naive) way to take advantage of static arrays would be:
function doubles_using_SA()
x1 = SA[ones(10)]
# Do some more stuff
x2 = double(x1)
# Do some more stuff
x3 = double(x2)
# Do some more stuff
x4 = double(x3)
# Do some more stuff
x5 = double(x4)
# Do some more stuff
x6 = double(x5)
# Do some more stuff
x7 = double(x6)
# Do some more stuff
x8 = double(x7)
# Do some more stuff
x9 = double(x8)
# Do some more stuff
x10 = double(x9)
# Do some more stuff
return x1, x2, x3, x4, x5, x6, x7, x8, x9, x10
end
Obviously, this gets very long, especially given that # Do some more stuff itself contains a few lines. My question is: is there a way to write doubles_using_SA() using a loop, where x1,β¦,x10 live on the stack, but can be referred to using i and i+1? I imagine this could be done using metaprogramming, but Iβve never used it and given what is said about it, I question whether it is worth the effort and complexity (and could itself add significant computation time).
function doubles_using_sarrays()
xs = [@SVector(ones(10)) for i in 1:10]
for i in 1:9
xs[i+1] = double(xs[i])
# Do some more stuff
end
xs
end
Key points:
Donβt call ones(10) and then convert to a StaticArray. Not only does this first heap-allocate a ones(10) array before converting, but it is also type-unstable. The key to efficient use of StaticArrays is to let the compiler know the length of the array. The @SVector macro rewrites the ones(10) into a call to StaticArrays.ones((SVector){10}), which puts the length 10 in the type. β this is a common confusion, see e.g. Optimizing function with vector reassignments - #8 by stevengj
Since SVector is immutable, use = instead of .=.
(The only remaining heap allocation is the container xs. You could use a static array for that as well, if you want, but as the size of the object increases the advantage of static arrays is less. And, again, you only get an advantage if you know the length at compile time. Does it really make sense in your application to make the length a constant?)
Putting aside what happens in # Do some more stuff for now, things tend to live on heap so references can share them and their changes. If you donβt need that, I think you could work with StaticArrays.jlβs immutables and reinstantiate instead of mutate, likely easier via Accessors.jl. Thing is, I havenβt really tried doing this for any element bigger than a handful of scalars so I donβt know if things can be compiled to fiddling with some bits in a modest stack-allocated array before copying on return. I tend to go for preallocated arrays in addition to smaller elements.
Update, havenβt bothered to dig into the compiled code, but the performance hit suggests that the compiler doesnβt make it as efficient as an allocated vector. At least it shaves the allocations off. double_using_sarrays also seems a tad faster on v1.10.5 and with only 1 allocation (no separate Memory from Vector), and v1.11.1 is sporadically compiled with 20 allocations instead of 2. A recent thread has pointed out this odd allocation.
julia> function doubles_using_sarrays2()
xs = @SVector([@SVector(ones(10)) for i in 1:10])
for i in 1:9
@reset xs[i+1] = double(xs[i])
# Do some more stuff
end
xs
end
doubles_using_sarrays2 (generic function with 1 method)
julia> doubles_using_sarrays() == doubles_using_sarrays2()
true
julia> @benchmark doubles_using_sarrays()
BenchmarkTools.Trial: 10000 samples with 958 evaluations.
Range (min β¦ max): 94.676 ns β¦ 1.385 ΞΌs β GC (min β¦ max): 0.00% β¦ 75.39%
Time (median): 100.522 ns β GC (median): 0.00%
Time (mean Β± Ο): 146.475 ns Β± 118.044 ns β GC (mean Β± Ο): 14.76% Β± 16.52%
ββββββββββββββ ββ β
ββββββββββββββββββββββ ββ βββ β βββββββββββ ββ β β β β ββββ β ββββββββ β ββ β
94.7 ns Histogram: log(frequency) by time 731 ns <
Memory estimate: 928 bytes, allocs estimate: 2.
julia> @benchmark doubles_using_sarrays2()
BenchmarkTools.Trial: 10000 samples with 346 evaluations.
Range (min β¦ max): 261.561 ns β¦ 1.692 ΞΌs β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 274.277 ns β GC (median): 0.00%
Time (mean Β± Ο): 332.281 ns Β± 136.370 ns β GC (mean Β± Ο): 0.00% Β± 0.00%
βββββββββ βββ ββ β β β
βββββββββββββββββββββββββββββββββββββββββββββββ β ββ βββ βββ β ββββ β
262 ns Histogram: log(frequency) by time 881 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Thanks for the tip. Yes I wasnβt too careful. If I modify my doubles_using_SA() the same way, I still get a βΌ4X speed gain compared to doubles_using_sarrays() though (even making xs a MVector).
Semantically itβs instantiating another value. Whether that involves overhauls or small changes is up to the compiler figuring out what it can get away with to pull off your code. Hypothetically, a compiler could do @reset massivestruct.x += 1 with the same efficient change on the stack as x += 1. StaticArrays.jl does say that its current implementation stresses the compiler and gets slower than heap-allocated Arrays at larger sizes, the pre-1.11 rule of thumb being 100 elements like in your example. Silver lining is thereβs room for improvement.
Itβs also worth pointing out that returning xs is a problem even with the most efficient implementation on the stack because it needs to copy that array before the stack frame goes, whereas a heap-allocated array only needs its pointer to be copied around. 800 bytes isnβt too bad once, but occupying multiple spots in the call stack or scaling to larger lengths is a real memory hog where there is much less memory (relative to heap) to go around. For a similar reason to variable-length arrays on the stack going the way of the dinosaurs, Julia is designed to lean toward easily scalable collections on the heap and immutable individual values on the stack, assuming itβs unusual to manually make a massive composite.
To help clarify, that extra dot changes the in-place broadcast kernel, saves the looped allocations done by a scalar-array multiplication in double(xs[i]). But that only trims it from 40 (2 per Array now) to the 22 before the loop in the example.
Thanks for the suggestion. I didnβt know this package. From what I understand, I think it creates normal arrays with views, no matter whether I create them as SArrays or not. But maybe you have something different in mind.