How to use `@generated` to eliminate a for loop?

As the title implies, I have some code that either requires hard coding a certain number of operations, or using a for loop to cycle through one of the indices. Performance-wise I have found that hard coding the operations gives the best speed, but then it is not as generic as I’d like.

I can achieve both using metaprogramming with @generated and Meta.parse, except that I’ve seen posts like this one that say to pretty much never use strings and Meta.parse. The type of code I’d like to generate with N elements in the 3rd dimension. I am not using the built-in sum because it is slower, and I would like to be able to do other operations in the function in the future.

# The function that I'm trying to recreate
my_matrix_sum(x) = @views @. x[:, :, 1] + x[:, :, 2] + x[:, :, 3]

# And here is the code I came up with to generate the correct function
@generated function str2ex(array, ch::NTuple{N}) where N
    strex = join([["array[:, :, $i]" for i = 1:N]...], " + ")
    strex = "@views @. $strex"
    return Meta.parse(strex)

function array_sum(array::Array{T, 3}) where T
    dummy = Tuple([1 for i = 1:size(array)[3]])
   return str2ex(array, dummy)

I could not find a way to pass the size of the third dimension of the array into @generated in order write the correct number of additions, so I used a giant hack to create an Tuple of the correct length N, which does get passed. So, I’m using a non-recommended way to generate the function, and to do so requires a hack, BUT it generates the same code and has identical performance. Is there a better way to do this? I figure there must be.

I’d say the most standard way to lift a value to the type domain would be to use Val

Also, there would be ways to write your @generated function so that is manipulates expressions rather than strings (which would remove the need for Meta.parse), but I don’t think it’s really necessary to use a @generated function here.

I’d for example rather use NTuples, for which the compiler can often generate very optimized code in which loops have been unrolled:

julia> x = rand(1000, 1000, 4);

julia> my_matrix_sum(x) = @views @. x[:,:,1] + x[:,:,2] + x[:,:,3] + x[:,:,4]
my_matrix_sum (generic function with 1 method)

julia> using BenchmarkTools
julia> ref = my_matrix_sum(x);
julia> @btime my_matrix_sum($x);
  1.245 ms (2 allocations: 7.63 MiB)
julia> my_matrix_sum2(x) = my_matrix_sum2(x, Val(size(x, 3)));
       function my_matrix_sum2(x, N::Val)
           t = ntuple(N) do i
               @view x[:,:,i]
my_matrix_sum2 (generic function with 2 methods)

julia> @assert ref == my_matrix_sum2(x);
julia> @btime my_matrix_sum2($x);
  1.252 ms (2 allocations: 7.63 MiB)

Well your array_sum is using a global array instead of the one that’s passed — it’s so easy to get things like that wrong with metaprogramming like this.

If you’re gonna use an @generated function, one step better is to use the old Base.Cartesian macros:

my_matrix_sum(x) = @views @. x[:, :, 1] + x[:, :, 2] + x[:, :, 3]
cartesian_sum(X) = cartesian_sum(X, Val(size(X, 3)))
@generated function cartesian_sum(X, ::Val{N}) where N
    :(@ncall $N (.+) i->view(X, :, :, i))

julia> X = rand(200,200,3);

julia> @btime my_matrix_sum($X);
  11.583 μs (2 allocations: 312.55 KiB)

julia> @btime cartesian_sum($X);
  11.916 μs (2 allocations: 312.55 KiB)

But really, I wouldn’t use generated functions here at all — and yep, @ffevotte has the better answer already as I was writing this one.

It is pretty magical that broadcasted @views is faster than sum or reduce or even simple-@simd’d loops here.


Interesting that there’s such a big difference between sum and the unrolled versions here. I’m guessing it has to do with the iteration order and cache behavior? When the summed dimension is small, such that looping over it doesn’t result in cache evictions, I guess the broadcasted + could result in a better order?

That said, you can implement what you want (in this minimal example) without any metaprogramming by simply splatting the slices into a broadcast

julia> X = rand(200,200,3);

julia> @btime sum($X;dims=3);
  52.800 μs (7 allocations: 312.67 KiB)

julia> @btime sum(eachslice($X;dims=3));
  49.300 μs (5 allocations: 625.16 KiB)

julia> @btime my_matrix_sum($X);
  25.100 μs (2 allocations: 312.55 KiB)

julia> @btime broadcast(+, eachslice($X;dims=3)...);
  24.900 μs (12 allocations: 313.09 KiB)

julia> @btime broadcast(+, ntuple(i->selectdim($X,3,i), Val(3))...);
  25.100 μs (2 allocations: 312.55 KiB)

The broadcast+eachslice version uses splatting to invoke dynamic dispatch to determine how many slices to sum at runtime. The ntuple version (which is just a different writing of the my_matrix_sum2 suggested by ffevotte above) uses a Val to lift this to compile time, avoiding dynamic dispatch if the number of slices is known at compile time.

But if these are inadequate in your general case, then the metaprogramming suggestions may become necessary.

1 Like

Thanks, this looks like it could work! I’ll need to test it out when its not late in the evening. In general, I would like a solution that could easily be modified to more complex operations (adding weights to each slice, etc.), but I can save that for another day. It’s also interesting that the ntuple(N) do i syntax has a similar speed since it reads like a for loop to me, which always tested slower on my machine.

I’m having a hard time finding the global. I did discover a typo in the original line, which was

@generated function str2ex(str, ch::NTuple{N}) where N

but str should be array. I’ve fixed the function above, does it still have a global in there with the fix?

For sure! I was a more than a little surprised by that myself.

Yeah, the size of the arrays should always be known ahead of time I think. I’m not sure yet how much flexibility I will need, but all these are awesome suggestions.