Generic n-dimensional loop without allocations

I’m working on some code that should iterate over milti-dimensional space, where the dimension dim is provided as a parameter to the function. The simplest approach I found is to use CartesianRange, like in the following simple example:

function test1(dim::Int)
  imin=fill(0, dim)
  imax=fill(0, dim)
  accumulator = 0
  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin...), CartesianIndex(imax...))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  return accumulator
end

While this works great, there is a finite memory allocation for each splatting imin… and imax…

@time test1(2);
  0.191940 seconds (179.32 k allocations: 5.626 MB)

For the real application, this alone can exhaust my available memory, as many such multidimensional loops are required.

A very similar function with “hard-coded” dim=2 completely avoids this problem, and is also much faster

function test2()
  imin=fill(0, 2)
  imax=fill(0, 2)
  accumulator = 0
  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin[1],imin[2]), CartesianIndex(imax[1],imax[2]))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  return accumulator
end
  0.001797 seconds (7 allocations: 368 bytes)

Now the question: is it possible to avoid this multidimensional loop allocation without sacrificing hard-coding the dimensions and the splatting of the CartesianRange?

Cheers!

I figured out a possible solution using a generated function

@generated function build_iterator{N}(dim::Type{Val{N}}, imin::Vector{Int}, imax::Vector{Int})
  if dim==Type{Val{1}}
    return :(CartesianRange(CartesianIndex(imin[1]), CartesianIndex(imax[1])))
  elseif dim==Type{Val{2}}
    return :(CartesianRange(CartesianIndex(imin[1],imin[2]), CartesianIndex(imax[1],imax[2])))
  elseif dim==Type{Val{3}}
    return :(CartesianRange(CartesianIndex(imin[1],imin[2], imin[3]), CartesianIndex(imax[1],imax[2], imax[3])))
  else
    return :(CartesianRange(CartesianIndex(imin...), CartesianIndex(imax...)))
  end
end

test3(dim::Int) = test(dim, Val{dim})
function test3{N}(dim::Int, ::Type{Val{N}})
  imin=fill(0, N)
  imax=fill(0, N)
  accumulator = 0

  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in build_iterator(Val{N}, imin, imax)
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

Now this is just as fast as the hard-coded version

@time test3(2)
  0.001346 seconds (7 allocations: 368 bytes)

Please do comment if you know of a simpler or more elegant solution!

1 Like

you could probably also use StaticArrays, which are essentially tuples, to avoid allocation of arrays, then you avoid generated functions.

For the elegance part of the question, you may avoid the if cases by utilizing something like this:

ex = :(imin[1])
for i in 2:(dim.parameters[1].parameters[1])
  ex = :($ex, imin[$i])
end

Interesting, thanks @cortner! Your suggestion is to use a StaticArray for the ranges imin and imax, correct? The problem I see is that StaticArrays cannot be mutated in-place, can they? So how could one change imin and imax in the above example (the rand! lines) without allocating a new StaticArray at each cycle?

Haha, that’s very cool @akis, cheers.

@generated function build_iterator{N}(dim::Type{Val{N}}, imin::Vector{Int}, imax::Vector{Int})
  exmin = :(imin[1])
  exmax = :(imax[1])
  for i in 2:(dim.parameters[1].parameters[1])
    exmin = :($exmin, imin[$i])
    exmax = :($exmax, imax[$i])
  end
  return :(CartesianRange(CartesianIndex($exmin), CartesianIndex($exmax)))
end
1 Like

StaticArrays provides the MVector type, which is mutable but still fixed-size, so you can use its length as part of your function signature:

function foo{N, T}(imin::MVector{N, T})
...

Also, it’s worth noting that constructing a new SVector at every loop iteration might actually be even faster, because constructing a new SVector does not actually allocate any heap memory (it just lives on the stack).

1 Like

Sorry that is what I meant - SVecror allocation is cheap.

I tried the StaticArray approach, but I cannot get it to perform as well as the generated function approach. I tried both SVector and MVector. This is the mutating version

using StaticArrays

function test4(dim::Int)
  imin=zeros(MVector{dim, Int})
  imax=zeros(MVector{dim, Int})
  accumulator = 0

  for k = 1:10000
    # In actual code, imin and imax are the result of a non-trivial calculation
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin.data), CartesianIndex(imax.data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

This is the static version. I cannot get around the need to have some mutating variable at some point in the real code, because the ranges imin and imax are the result of a non-trivial calculation.

using StaticArrays

function test5(dim::Int)
  imin=fill(0, dim)
  imax=fill(0, dim)
  accumulator = 0

  for k = 1:10000
    # In actual code, imin and imax are the result of a non-trivial calculation
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    smin = SVector{dim}(imin)
    smax = SVector{dim}(imax)
    for c in CartesianRange(CartesianIndex(smin.data), CartesianIndex(smax.data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

Both these tests perform as badly as test1 above. I’m most probably doing something wrong here :frowning:

@time test4(2)
  0.208932 seconds (193.47 k allocations: 6.058 MB)
@time test5(2)
  0.370437 seconds (377.44 k allocations: 16.555 MB, 1.80% gc time)

The build_iterator function can be written more simply using the macros from Base.Cartesian, such as @ntuple or @ncall. For example

@generated function build_iterator{N}(dim::Type{Val{N}}, imin, imax)
    quote
        CartesianRange(
            CartesianIndex(@ntuple $N n -> imin[n]),
            CartesianIndex(@ntuple $N n -> imax[n]),
        )
    end
end

with test_6 defined as

function test_6{N}(dim::Type{Val{N}})
    imin = zeros(Int, N)
    imax = zeros(Int, N)
    acc = 0
    for k = 1:10_000
        rand!(imin, 1:5)
        rand!(imax, 1:5)
        for c in build_iterator(dim, imin, imax)
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end

I get

julia> @time test_6(Val{4})
  0.001851 seconds (7 allocations: 400 bytes)
228336

For an approach that doesn’t used any @generated trickery I’d go with something like

function test_7{N}(dim::Type{Val{N}})
    v_min = zeros(Int, N)
    v_max = zeros(Int, N)
    acc = 0
    for k = 1:10_000
        rand!(v_min, 1:5)
        rand!(v_max, 1:5)
        t_min = ntuple(n -> v_min[n], dim)
        t_max = ntuple(n -> v_max[n], dim)
        for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end

with time


julia> @time test_7(Val{4})
  0.002572 seconds (20.01 k allocations: 312.891 KB)
229552

For the StaticArrays examples the main problem that I can see is that dim is being passed to test4 as a value rather than a type parameter so imin and imax won’t be type stable. You can check this with @code_warntype test4(4) and you’ll see a lot of read Anys all over the place. Changing the test4 signature to test4{dim}(::Type{Val{dim}}) should fix that I think.

Edit

Here’s a third option that doesn’t do any unnecessary extra allocations

randtuple(T, t) = _randtuple(T, t...)

_randtuple(T) = ()
_randtuple(T, h, t...) = (rand(T), _randtuple(T, t...)...)

function test_8(t_min, t_max)
    acc = 0
    for k = 1:10_000
        t_min = randtuple(1:5, t_min)
        t_max = randtuple(1:5, t_max)
        for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end
test_8(dim) = test_8(dim, dim)

It does need to be passed an initial tuple, but that could also probably be abstracted away as well:

julia> @time test_8((0,0,0,0,0,0,0,0))
  0.016166 seconds (6 allocations: 256 bytes)
733783

Yes!! Thanks @mike, that’s it. I should have checked the @code_warntype output of course.

Properly type-stable versions of the SVector and MVector approaches are now as performant as the @generated approach. I put them all down here as a summary

using Base.Cartesian

@generated function build_iterator{N}(dim::Type{Val{N}}, imin, imax)
    quote
        CartesianRange(
            CartesianIndex(@ntuple $N n -> imin[n]),
            CartesianIndex(@ntuple $N n -> imax[n]),
        )
    end
end

test3(dim::Int) = test3(Val{dim})
function test3{N}(dim::Type{Val{N}})
  imin=fill(0, N)
  imax=fill(0, N)
  accumulator = 0

  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in build_iterator(Val{N}, imin, imax)
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end
using StaticArrays

test4(dim::Int) = test4(Val{dim})
function test4{N}(dim::Type{Val{N}})
  imin=zeros(MVector{N, Int})
  imax=zeros(MVector{N, Int})
  accumulator = 0

  for k = 1:10_000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin.data), CartesianIndex(imax.data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end
using StaticArrays

test5(dim::Int) = test5(Val{dim})
function test5{N}(dim::Type{Val{N}})
  imin=fill(0, N)
  imax=fill(0, N)  
  accumulator = 0

  for k = 1:10000
    # In actual code, imin and imax are the result of a non-trivial calculation
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(
      CartesianIndex(SVector{N}(imin).data), 
      CartesianIndex(SVector{N}(imax).data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

Timing tests point to MVector (test4) as the best, although by a small margin. I also think it is the coolest one. I need to think a bit more about @mike’s test_8 idea.

@benchmark test3(2)
#[...]
  memory estimate:  208.00 bytes
  allocs estimate:  3
  mean time:        1.327 ms (0.00% GC)
#[...]
@benchmark test4(2)
#[...]
  memory estimate:  80.00 bytes
  allocs estimate:  3
  mean time:        1.201 ms (0.00% GC)
#[...]
@benchmark test5(2)
#[...]
  memory estimate:  208.00 bytes
  allocs estimate:  3
  mean time:        1.538 ms (0.00% GC)
#[...]

Edit: added test_7 and test_8 benchmarks for completeness
Edit 2: test_7 has a typo I think. dim should read N in the ntuple lines. Benchmark for the corrected version.
Edit 3: reverted edit2 (not a typo!)

@benchmark test_7(Val{2})
#[...]
  memory estimate:  312.70 kb
  allocs estimate:  20003
  mean time:        1.586 ms (1.36% GC)
#[...]
@benchmark test_8((0,0))
#[...]
  memory estimate:  0.00 bytes
  allocs estimate:  0
  mean time:        1.801 ms (0.00% GC)
#[...]
1 Like

Well I suppose if you’re looking for zero allocations then test_8 does win there :smile:, pity about the slight slowdown though. For me the timings are slightly closer between the two:

mean time:        1.266 ms (0.00% GC) # test_8
mean time:        1.043 ms (0.00% GC) # test4

perhaps due to Julia version differences/hardware.

Wasn’t a typo, though it does look like one at first. ntuple can take either an Int or a Val{N} where N is some integer. One is inferrable while the other one isn’t, hence the huge slow down in the edited benchmark. You can see the difference in behaviour with @code_warntype

julia> @code_warntype ntuple(x -> 1, Val{3}) # good
...

julia> @code_warntype ntuple(x -> 1, 3) # bad
...

Indeed :slight_smile:
Regarding the ntuple thing, gotit! Thanks.

Another non-allocating option, close to the test_8 that @mike proposed, but that uses Val{N} would be

function test_9{N}(::Type{Val{N}})
    acc = 0
    for k = 1:10_000
        t_min = ntuple(_->rand(1:5), Val{N})
        t_max = ntuple(_->rand(1:5), Val{N})
        for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end

Hey, wow, that’s nice!

@benchmark test_9(Val{2})
  memory estimate:  0.00 bytes
  allocs estimate:  0
  mean time:        1.809 ms (0.00% GC)

Lots to ruminate here for me… I still need to understand whether this test_9 idea can actually be applied into my real-use code, where the rand becomes a more complicated thing. It would also be nice to understand why, despite making zero allocations, this is still slower than the MVector way… Thanks a lot to everyone for the great ideas, by the way!

Yeah, @pabloferz’s ntuple is good too, could also use map(_ -> rand(1:5), t_min) to get pretty much the same effect as randtuple in test_8. All those approaches probably compile down to quite similar code in the end I’d suspect.

From what I can tell from the output of @profile for both calls (test_8 and test4) it seems that the MVector approach is making use of a different code path in random.jl that uses setindex! on the vector rather than “rebuilding” a completely new tuple each time with ntuple and _ntuple. Tuples are immutable, hence the difference in how the new set of random integers gets created. It seems that StaticArrays does use a tuple as the backing data behind MVector, but then does this to change elements in the tuple.

That’s just my guess from looking into the StaticArrays code though. Perhaps @andyferris will have some further insight into the differences.