# Generic n-dimensional loop without allocations

#1

I’m working on some code that should iterate over milti-dimensional space, where the dimension `dim` is provided as a parameter to the function. The simplest approach I found is to use CartesianRange, like in the following simple example:

``````function test1(dim::Int)
imin=fill(0, dim)
imax=fill(0, dim)
accumulator = 0
for k = 1:10000
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in CartesianRange(CartesianIndex(imin...), CartesianIndex(imax...))
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
return accumulator
end
``````

While this works great, there is a finite memory allocation for each splatting imin… and imax…

``````@time test1(2);
0.191940 seconds (179.32 k allocations: 5.626 MB)
``````

For the real application, this alone can exhaust my available memory, as many such multidimensional loops are required.

A very similar function with “hard-coded” dim=2 completely avoids this problem, and is also much faster

``````function test2()
imin=fill(0, 2)
imax=fill(0, 2)
accumulator = 0
for k = 1:10000
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in CartesianRange(CartesianIndex(imin[1],imin[2]), CartesianIndex(imax[1],imax[2]))
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
return accumulator
end
``````
``````  0.001797 seconds (7 allocations: 368 bytes)
``````

Now the question: is it possible to avoid this multidimensional loop allocation without sacrificing hard-coding the dimensions and the splatting of the CartesianRange?

Cheers!

#2

I figured out a possible solution using a generated function

``````@generated function build_iterator{N}(dim::Type{Val{N}}, imin::Vector{Int}, imax::Vector{Int})
if dim==Type{Val{1}}
return :(CartesianRange(CartesianIndex(imin[1]), CartesianIndex(imax[1])))
elseif dim==Type{Val{2}}
return :(CartesianRange(CartesianIndex(imin[1],imin[2]), CartesianIndex(imax[1],imax[2])))
elseif dim==Type{Val{3}}
return :(CartesianRange(CartesianIndex(imin[1],imin[2], imin[3]), CartesianIndex(imax[1],imax[2], imax[3])))
else
return :(CartesianRange(CartesianIndex(imin...), CartesianIndex(imax...)))
end
end

test3(dim::Int) = test(dim, Val{dim})
function test3{N}(dim::Int, ::Type{Val{N}})
imin=fill(0, N)
imax=fill(0, N)
accumulator = 0

for k = 1:10000
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in build_iterator(Val{N}, imin, imax)
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
accumulator
end
``````

Now this is just as fast as the hard-coded version

``````@time test3(2)
0.001346 seconds (7 allocations: 368 bytes)
``````

Please do comment if you know of a simpler or more elegant solution!

#3

you could probably also use StaticArrays, which are essentially tuples, to avoid allocation of arrays, then you avoid generated functions.

#4

For the elegance part of the question, you may avoid the `if` cases by utilizing something like this:

``````ex = :(imin[1])
for i in 2:(dim.parameters[1].parameters[1])
ex = :(\$ex, imin[\$i])
end
``````

#5

Interesting, thanks @cortner! Your suggestion is to use a StaticArray for the ranges imin and imax, correct? The problem I see is that StaticArrays cannot be mutated in-place, can they? So how could one change imin and imax in the above example (the rand! lines) without allocating a new StaticArray at each cycle?

#6

Haha, that’s very cool @akis, cheers.

``````@generated function build_iterator{N}(dim::Type{Val{N}}, imin::Vector{Int}, imax::Vector{Int})
exmin = :(imin[1])
exmax = :(imax[1])
for i in 2:(dim.parameters[1].parameters[1])
exmin = :(\$exmin, imin[\$i])
exmax = :(\$exmax, imax[\$i])
end
return :(CartesianRange(CartesianIndex(\$exmin), CartesianIndex(\$exmax)))
end
``````

#7

StaticArrays provides the MVector type, which is mutable but still fixed-size, so you can use its length as part of your function signature:

``````function foo{N, T}(imin::MVector{N, T})
...
``````

Also, it’s worth noting that constructing a new `SVector` at every loop iteration might actually be even faster, because constructing a new `SVector` does not actually allocate any heap memory (it just lives on the stack).

#8

Sorry that is what I meant - SVecror allocation is cheap.

#9

I tried the StaticArray approach, but I cannot get it to perform as well as the generated function approach. I tried both SVector and MVector. This is the mutating version

``````using StaticArrays

function test4(dim::Int)
imin=zeros(MVector{dim, Int})
imax=zeros(MVector{dim, Int})
accumulator = 0

for k = 1:10000
# In actual code, imin and imax are the result of a non-trivial calculation
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in CartesianRange(CartesianIndex(imin.data), CartesianIndex(imax.data))
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
accumulator
end
``````

This is the static version. I cannot get around the need to have some mutating variable at some point in the real code, because the ranges imin and imax are the result of a non-trivial calculation.

``````using StaticArrays

function test5(dim::Int)
imin=fill(0, dim)
imax=fill(0, dim)
accumulator = 0

for k = 1:10000
# In actual code, imin and imax are the result of a non-trivial calculation
rand!(imin, 1:5)
rand!(imax, 1:5)
smin = SVector{dim}(imin)
smax = SVector{dim}(imax)
for c in CartesianRange(CartesianIndex(smin.data), CartesianIndex(smax.data))
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
accumulator
end
``````

Both these tests perform as badly as test1 above. I’m most probably doing something wrong here

``````@time test4(2)
0.208932 seconds (193.47 k allocations: 6.058 MB)
``````
``````@time test5(2)
0.370437 seconds (377.44 k allocations: 16.555 MB, 1.80% gc time)
``````

#10

The `build_iterator` function can be written more simply using the macros from `Base.Cartesian`, such as `@ntuple` or `@ncall`. For example

``````@generated function build_iterator{N}(dim::Type{Val{N}}, imin, imax)
quote
CartesianRange(
CartesianIndex(@ntuple \$N n -> imin[n]),
CartesianIndex(@ntuple \$N n -> imax[n]),
)
end
end
``````

with `test_6` defined as

``````function test_6{N}(dim::Type{Val{N}})
imin = zeros(Int, N)
imax = zeros(Int, N)
acc = 0
for k = 1:10_000
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in build_iterator(dim, imin, imax)
acc += c.I[1] + c.I[2]
end
end
return acc
end
``````

I get

``````julia> @time test_6(Val{4})
0.001851 seconds (7 allocations: 400 bytes)
228336
``````

For an approach that doesn’t used any `@generated` trickery I’d go with something like

``````function test_7{N}(dim::Type{Val{N}})
v_min = zeros(Int, N)
v_max = zeros(Int, N)
acc = 0
for k = 1:10_000
rand!(v_min, 1:5)
rand!(v_max, 1:5)
t_min = ntuple(n -> v_min[n], dim)
t_max = ntuple(n -> v_max[n], dim)
for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
acc += c.I[1] + c.I[2]
end
end
return acc
end
``````

with time

``````
julia> @time test_7(Val{4})
0.002572 seconds (20.01 k allocations: 312.891 KB)
229552
``````

For the `StaticArrays` examples the main problem that I can see is that `dim` is being passed to `test4` as a value rather than a type parameter so `imin` and `imax` won’t be type stable. You can check this with `@code_warntype test4(4)` and you’ll see a lot of read `Any`s all over the place. Changing the `test4` signature to `test4{dim}(::Type{Val{dim}})` should fix that I think.

Edit

Here’s a third option that doesn’t do any unnecessary extra allocations

``````randtuple(T, t) = _randtuple(T, t...)

_randtuple(T) = ()
_randtuple(T, h, t...) = (rand(T), _randtuple(T, t...)...)

function test_8(t_min, t_max)
acc = 0
for k = 1:10_000
t_min = randtuple(1:5, t_min)
t_max = randtuple(1:5, t_max)
for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
acc += c.I[1] + c.I[2]
end
end
return acc
end
test_8(dim) = test_8(dim, dim)
``````

It does need to be passed an initial tuple, but that could also probably be abstracted away as well:

``````julia> @time test_8((0,0,0,0,0,0,0,0))
0.016166 seconds (6 allocations: 256 bytes)
733783
``````

#11

Yes!! Thanks @mike, that’s it. I should have checked the `@code_warntype` output of course.

Properly type-stable versions of the SVector and MVector approaches are now as performant as the `@generated` approach. I put them all down here as a summary

``````using Base.Cartesian

@generated function build_iterator{N}(dim::Type{Val{N}}, imin, imax)
quote
CartesianRange(
CartesianIndex(@ntuple \$N n -> imin[n]),
CartesianIndex(@ntuple \$N n -> imax[n]),
)
end
end

test3(dim::Int) = test3(Val{dim})
function test3{N}(dim::Type{Val{N}})
imin=fill(0, N)
imax=fill(0, N)
accumulator = 0

for k = 1:10000
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in build_iterator(Val{N}, imin, imax)
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
accumulator
end
``````
``````using StaticArrays

test4(dim::Int) = test4(Val{dim})
function test4{N}(dim::Type{Val{N}})
imin=zeros(MVector{N, Int})
imax=zeros(MVector{N, Int})
accumulator = 0

for k = 1:10_000
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in CartesianRange(CartesianIndex(imin.data), CartesianIndex(imax.data))
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
accumulator
end
``````
``````using StaticArrays

test5(dim::Int) = test5(Val{dim})
function test5{N}(dim::Type{Val{N}})
imin=fill(0, N)
imax=fill(0, N)
accumulator = 0

for k = 1:10000
# In actual code, imin and imax are the result of a non-trivial calculation
rand!(imin, 1:5)
rand!(imax, 1:5)
for c in CartesianRange(
CartesianIndex(SVector{N}(imin).data),
CartesianIndex(SVector{N}(imax).data))
# do something with c, e.g
accumulator += c.I[1]+c.I[2]
end
end
accumulator
end
``````

Timing tests point to MVector (test4) as the best, although by a small margin. I also think it is the coolest one. I need to think a bit more about @mike’s test_8 idea.

``````@benchmark test3(2)
#[...]
memory estimate:  208.00 bytes
allocs estimate:  3
mean time:        1.327 ms (0.00% GC)
#[...]
``````
``````@benchmark test4(2)
#[...]
memory estimate:  80.00 bytes
allocs estimate:  3
mean time:        1.201 ms (0.00% GC)
#[...]
``````
``````@benchmark test5(2)
#[...]
memory estimate:  208.00 bytes
allocs estimate:  3
mean time:        1.538 ms (0.00% GC)
#[...]
``````

Edit: added test_7 and test_8 benchmarks for completeness
Edit 2: test_7 has a typo I think. dim should read N in the ntuple lines. Benchmark for the corrected version.
Edit 3: reverted edit2 (not a typo!)

``````@benchmark test_7(Val{2})
#[...]
memory estimate:  312.70 kb
allocs estimate:  20003
mean time:        1.586 ms (1.36% GC)
#[...]
``````
``````@benchmark test_8((0,0))
#[...]
memory estimate:  0.00 bytes
allocs estimate:  0
mean time:        1.801 ms (0.00% GC)
#[...]
``````

#12

Well I suppose if you’re looking for zero allocations then `test_8` does win there , pity about the slight slowdown though. For me the timings are slightly closer between the two:

``````mean time:        1.266 ms (0.00% GC) # test_8
mean time:        1.043 ms (0.00% GC) # test4
``````

perhaps due to Julia version differences/hardware.

Wasn’t a typo, though it does look like one at first. `ntuple` can take either an `Int` or a `Val{N}` where `N` is some integer. One is inferrable while the other one isn’t, hence the huge slow down in the edited benchmark. You can see the difference in behaviour with `@code_warntype`

``````julia> @code_warntype ntuple(x -> 1, Val{3}) # good
...

julia> @code_warntype ntuple(x -> 1, 3) # bad
...
``````

#13

Indeed
Regarding the ntuple thing, gotit! Thanks.

#14

Another non-allocating option, close to the `test_8` that @mike proposed, but that uses Val{N} would be

``````function test_9{N}(::Type{Val{N}})
acc = 0
for k = 1:10_000
t_min = ntuple(_->rand(1:5), Val{N})
t_max = ntuple(_->rand(1:5), Val{N})
for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
acc += c.I[1] + c.I[2]
end
end
return acc
end
``````

#15

Hey, wow, that’s nice!

``````@benchmark test_9(Val{2})
memory estimate:  0.00 bytes
allocs estimate:  0
mean time:        1.809 ms (0.00% GC)

``````

Lots to ruminate here for me… I still need to understand whether this test_9 idea can actually be applied into my real-use code, where the rand becomes a more complicated thing. It would also be nice to understand why, despite making zero allocations, this is still slower than the MVector way… Thanks a lot to everyone for the great ideas, by the way!

#16

Yeah, @pabloferz’s `ntuple` is good too, could also use `map(_ -> rand(1:5), t_min)` to get pretty much the same effect as `randtuple` in `test_8`. All those approaches probably compile down to quite similar code in the end I’d suspect.

From what I can tell from the output of `@profile` for both calls (`test_8` and `test4`) it seems that the `MVector` approach is making use of a different code path in `random.jl` that uses `setindex!` on the vector rather than “rebuilding” a completely new tuple each time with `ntuple` and `_ntuple`. Tuples are immutable, hence the difference in how the new set of random integers gets created. It seems that `StaticArrays` does use a tuple as the backing data behind `MVector`, but then does this to change elements in the tuple.

That’s just my guess from looking into the `StaticArrays` code though. Perhaps @andyferris will have some further insight into the differences.