Generic n-dimensional loop without allocations

question

#1

I’m working on some code that should iterate over milti-dimensional space, where the dimension dim is provided as a parameter to the function. The simplest approach I found is to use CartesianRange, like in the following simple example:

function test1(dim::Int)
  imin=fill(0, dim)
  imax=fill(0, dim)
  accumulator = 0
  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin...), CartesianIndex(imax...))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  return accumulator
end

While this works great, there is a finite memory allocation for each splatting imin… and imax…

@time test1(2);
  0.191940 seconds (179.32 k allocations: 5.626 MB)

For the real application, this alone can exhaust my available memory, as many such multidimensional loops are required.

A very similar function with “hard-coded” dim=2 completely avoids this problem, and is also much faster

function test2()
  imin=fill(0, 2)
  imax=fill(0, 2)
  accumulator = 0
  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin[1],imin[2]), CartesianIndex(imax[1],imax[2]))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  return accumulator
end
  0.001797 seconds (7 allocations: 368 bytes)

Now the question: is it possible to avoid this multidimensional loop allocation without sacrificing hard-coding the dimensions and the splatting of the CartesianRange?

Cheers!


#2

I figured out a possible solution using a generated function

@generated function build_iterator{N}(dim::Type{Val{N}}, imin::Vector{Int}, imax::Vector{Int})
  if dim==Type{Val{1}}
    return :(CartesianRange(CartesianIndex(imin[1]), CartesianIndex(imax[1])))
  elseif dim==Type{Val{2}}
    return :(CartesianRange(CartesianIndex(imin[1],imin[2]), CartesianIndex(imax[1],imax[2])))
  elseif dim==Type{Val{3}}
    return :(CartesianRange(CartesianIndex(imin[1],imin[2], imin[3]), CartesianIndex(imax[1],imax[2], imax[3])))
  else
    return :(CartesianRange(CartesianIndex(imin...), CartesianIndex(imax...)))
  end
end

test3(dim::Int) = test(dim, Val{dim})
function test3{N}(dim::Int, ::Type{Val{N}})
  imin=fill(0, N)
  imax=fill(0, N)
  accumulator = 0

  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in build_iterator(Val{N}, imin, imax)
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

Now this is just as fast as the hard-coded version

@time test3(2)
  0.001346 seconds (7 allocations: 368 bytes)

Please do comment if you know of a simpler or more elegant solution!


#3

you could probably also use StaticArrays, which are essentially tuples, to avoid allocation of arrays, then you avoid generated functions.


#4

For the elegance part of the question, you may avoid the if cases by utilizing something like this:

ex = :(imin[1])
for i in 2:(dim.parameters[1].parameters[1])
  ex = :($ex, imin[$i])
end

#5

Interesting, thanks @cortner! Your suggestion is to use a StaticArray for the ranges imin and imax, correct? The problem I see is that StaticArrays cannot be mutated in-place, can they? So how could one change imin and imax in the above example (the rand! lines) without allocating a new StaticArray at each cycle?


#6

Haha, that’s very cool @akis, cheers.

@generated function build_iterator{N}(dim::Type{Val{N}}, imin::Vector{Int}, imax::Vector{Int})
  exmin = :(imin[1])
  exmax = :(imax[1])
  for i in 2:(dim.parameters[1].parameters[1])
    exmin = :($exmin, imin[$i])
    exmax = :($exmax, imax[$i])
  end
  return :(CartesianRange(CartesianIndex($exmin), CartesianIndex($exmax)))
end

#7

StaticArrays provides the MVector type, which is mutable but still fixed-size, so you can use its length as part of your function signature:

function foo{N, T}(imin::MVector{N, T})
...

Also, it’s worth noting that constructing a new SVector at every loop iteration might actually be even faster, because constructing a new SVector does not actually allocate any heap memory (it just lives on the stack).


#8

Sorry that is what I meant - SVecror allocation is cheap.


#9

I tried the StaticArray approach, but I cannot get it to perform as well as the generated function approach. I tried both SVector and MVector. This is the mutating version

using StaticArrays

function test4(dim::Int)
  imin=zeros(MVector{dim, Int})
  imax=zeros(MVector{dim, Int})
  accumulator = 0

  for k = 1:10000
    # In actual code, imin and imax are the result of a non-trivial calculation
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin.data), CartesianIndex(imax.data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

This is the static version. I cannot get around the need to have some mutating variable at some point in the real code, because the ranges imin and imax are the result of a non-trivial calculation.

using StaticArrays

function test5(dim::Int)
  imin=fill(0, dim)
  imax=fill(0, dim)
  accumulator = 0

  for k = 1:10000
    # In actual code, imin and imax are the result of a non-trivial calculation
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    smin = SVector{dim}(imin)
    smax = SVector{dim}(imax)
    for c in CartesianRange(CartesianIndex(smin.data), CartesianIndex(smax.data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

Both these tests perform as badly as test1 above. I’m most probably doing something wrong here :frowning:

@time test4(2)
  0.208932 seconds (193.47 k allocations: 6.058 MB)
@time test5(2)
  0.370437 seconds (377.44 k allocations: 16.555 MB, 1.80% gc time)

#10

The build_iterator function can be written more simply using the macros from Base.Cartesian, such as @ntuple or @ncall. For example

@generated function build_iterator{N}(dim::Type{Val{N}}, imin, imax)
    quote
        CartesianRange(
            CartesianIndex(@ntuple $N n -> imin[n]),
            CartesianIndex(@ntuple $N n -> imax[n]),
        )
    end
end

with test_6 defined as

function test_6{N}(dim::Type{Val{N}})
    imin = zeros(Int, N)
    imax = zeros(Int, N)
    acc = 0
    for k = 1:10_000
        rand!(imin, 1:5)
        rand!(imax, 1:5)
        for c in build_iterator(dim, imin, imax)
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end

I get

julia> @time test_6(Val{4})
  0.001851 seconds (7 allocations: 400 bytes)
228336

For an approach that doesn’t used any @generated trickery I’d go with something like

function test_7{N}(dim::Type{Val{N}})
    v_min = zeros(Int, N)
    v_max = zeros(Int, N)
    acc = 0
    for k = 1:10_000
        rand!(v_min, 1:5)
        rand!(v_max, 1:5)
        t_min = ntuple(n -> v_min[n], dim)
        t_max = ntuple(n -> v_max[n], dim)
        for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end

with time


julia> @time test_7(Val{4})
  0.002572 seconds (20.01 k allocations: 312.891 KB)
229552

For the StaticArrays examples the main problem that I can see is that dim is being passed to test4 as a value rather than a type parameter so imin and imax won’t be type stable. You can check this with @code_warntype test4(4) and you’ll see a lot of read Anys all over the place. Changing the test4 signature to test4{dim}(::Type{Val{dim}}) should fix that I think.

Edit

Here’s a third option that doesn’t do any unnecessary extra allocations

randtuple(T, t) = _randtuple(T, t...)

_randtuple(T) = ()
_randtuple(T, h, t...) = (rand(T), _randtuple(T, t...)...)

function test_8(t_min, t_max)
    acc = 0
    for k = 1:10_000
        t_min = randtuple(1:5, t_min)
        t_max = randtuple(1:5, t_max)
        for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end
test_8(dim) = test_8(dim, dim)

It does need to be passed an initial tuple, but that could also probably be abstracted away as well:

julia> @time test_8((0,0,0,0,0,0,0,0))
  0.016166 seconds (6 allocations: 256 bytes)
733783

#11

Yes!! Thanks @mike, that’s it. I should have checked the @code_warntype output of course.

Properly type-stable versions of the SVector and MVector approaches are now as performant as the @generated approach. I put them all down here as a summary

using Base.Cartesian

@generated function build_iterator{N}(dim::Type{Val{N}}, imin, imax)
    quote
        CartesianRange(
            CartesianIndex(@ntuple $N n -> imin[n]),
            CartesianIndex(@ntuple $N n -> imax[n]),
        )
    end
end

test3(dim::Int) = test3(Val{dim})
function test3{N}(dim::Type{Val{N}})
  imin=fill(0, N)
  imax=fill(0, N)
  accumulator = 0

  for k = 1:10000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in build_iterator(Val{N}, imin, imax)
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end
using StaticArrays

test4(dim::Int) = test4(Val{dim})
function test4{N}(dim::Type{Val{N}})
  imin=zeros(MVector{N, Int})
  imax=zeros(MVector{N, Int})
  accumulator = 0

  for k = 1:10_000
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(CartesianIndex(imin.data), CartesianIndex(imax.data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end
using StaticArrays

test5(dim::Int) = test5(Val{dim})
function test5{N}(dim::Type{Val{N}})
  imin=fill(0, N)
  imax=fill(0, N)  
  accumulator = 0

  for k = 1:10000
    # In actual code, imin and imax are the result of a non-trivial calculation
    rand!(imin, 1:5)
    rand!(imax, 1:5)
    for c in CartesianRange(
      CartesianIndex(SVector{N}(imin).data), 
      CartesianIndex(SVector{N}(imax).data))
      # do something with c, e.g
      accumulator += c.I[1]+c.I[2]
    end
  end
  accumulator
end

Timing tests point to MVector (test4) as the best, although by a small margin. I also think it is the coolest one. I need to think a bit more about @mike’s test_8 idea.

@benchmark test3(2)
#[...]
  memory estimate:  208.00 bytes
  allocs estimate:  3
  mean time:        1.327 ms (0.00% GC)
#[...]
@benchmark test4(2)
#[...]
  memory estimate:  80.00 bytes
  allocs estimate:  3
  mean time:        1.201 ms (0.00% GC)
#[...]
@benchmark test5(2)
#[...]
  memory estimate:  208.00 bytes
  allocs estimate:  3
  mean time:        1.538 ms (0.00% GC)
#[...]

Edit: added test_7 and test_8 benchmarks for completeness
Edit 2: test_7 has a typo I think. dim should read N in the ntuple lines. Benchmark for the corrected version.
Edit 3: reverted edit2 (not a typo!)

@benchmark test_7(Val{2})
#[...]
  memory estimate:  312.70 kb
  allocs estimate:  20003
  mean time:        1.586 ms (1.36% GC)
#[...]
@benchmark test_8((0,0))
#[...]
  memory estimate:  0.00 bytes
  allocs estimate:  0
  mean time:        1.801 ms (0.00% GC)
#[...]

#12

Well I suppose if you’re looking for zero allocations then test_8 does win there :smile:, pity about the slight slowdown though. For me the timings are slightly closer between the two:

mean time:        1.266 ms (0.00% GC) # test_8
mean time:        1.043 ms (0.00% GC) # test4

perhaps due to Julia version differences/hardware.

Wasn’t a typo, though it does look like one at first. ntuple can take either an Int or a Val{N} where N is some integer. One is inferrable while the other one isn’t, hence the huge slow down in the edited benchmark. You can see the difference in behaviour with @code_warntype

julia> @code_warntype ntuple(x -> 1, Val{3}) # good
...

julia> @code_warntype ntuple(x -> 1, 3) # bad
...

#13

Indeed :slight_smile:
Regarding the ntuple thing, gotit! Thanks.


#14

Another non-allocating option, close to the test_8 that @mike proposed, but that uses Val{N} would be

function test_9{N}(::Type{Val{N}})
    acc = 0
    for k = 1:10_000
        t_min = ntuple(_->rand(1:5), Val{N})
        t_max = ntuple(_->rand(1:5), Val{N})
        for c in CartesianRange(CartesianIndex(t_min), CartesianIndex(t_max))
            acc += c.I[1] + c.I[2]
        end
    end
    return acc
end

#15

Hey, wow, that’s nice!

@benchmark test_9(Val{2})
  memory estimate:  0.00 bytes
  allocs estimate:  0
  mean time:        1.809 ms (0.00% GC)

Lots to ruminate here for me… I still need to understand whether this test_9 idea can actually be applied into my real-use code, where the rand becomes a more complicated thing. It would also be nice to understand why, despite making zero allocations, this is still slower than the MVector way… Thanks a lot to everyone for the great ideas, by the way!


#16

Yeah, @pabloferz’s ntuple is good too, could also use map(_ -> rand(1:5), t_min) to get pretty much the same effect as randtuple in test_8. All those approaches probably compile down to quite similar code in the end I’d suspect.

From what I can tell from the output of @profile for both calls (test_8 and test4) it seems that the MVector approach is making use of a different code path in random.jl that uses setindex! on the vector rather than “rebuilding” a completely new tuple each time with ntuple and _ntuple. Tuples are immutable, hence the difference in how the new set of random integers gets created. It seems that StaticArrays does use a tuple as the backing data behind MVector, but then does this to change elements in the tuple.

That’s just my guess from looking into the StaticArrays code though. Perhaps @andyferris will have some further insight into the differences.