Alllocations in a reduction over a Tuple

I’m trying to make a recursive reduction over a Tuple that calculates the inverse of the reduction of a nested Tuple everytime it encounters one. I.e. for x=(1,(2,3)), foo(x) should compute f(1)+\frac{1}{f(2)+f(3)}. The simplified MWE below works well in the absence of recursion, but otherwise it starts allocating. As ideally foo() needs to run several hundreds of thousands of times, I would need it not to allocate any memory if posible. On a side note, I also do not understand the difference in performance and memory allocations between the first and second @btime foo($x1) call.

struct myStruct1{T}
    x::T
end

struct myStruct2{T}
    x::T
end

@inline f(x::myStruct1, a) = a/x.x*2
@inline f(x::myStruct2, a) = a*x.x/2

@inline function foo(x) 
    c = Ref(0.0)
    ntuple(Val(length(x))) do i 
        Base.@_inline_meta
        if !(x[i] isa Tuple)
            c[]+= f(x[i], rand())
        else
            c[]+= 1/(foo(x[i]))
        end
    end
    c[]
end

a = myStruct1(1)
b = myStruct2(1)
x1 =  (a, ((b, a), b))
x2 =  (a, b, a, b)

@btime foo($x1) # 1st call 33.710 ns (3 allocations: 96 bytes)
@btime foo($x1) # 2nd call 17.940 ns (1 allocation: 16 bytes)
@btime foo($x2) # 8.373 ns (0 allocations: 0 bytes)

A warm welcome to the community.

I think the problem you do encounter is that

ntuple(Val(length(x))) do i 
    Base.@_inline_meta
    if !(x[i] isa Tuple)
        c[]+= f(x[i], rand())
    else
        c[]+= 1/(foo(x[i]))
    end
end

creates a tuple, which is not used, but as it seems not optimized away by the compiler.


You can get what you want by using a different kind of meta-programming magic.

bar(x) = bar(x,Val(length(x)))

@inline @generated function bar(x,::Val{N}) where N
	quote
		c = 0.0
		Base.Cartesian.@nexprs $N i -> c += x[i] isa Tuple ? 1/(bar(x[i],Val(length(x[i])))) : f(x[i],rand())
		return c
	end
end

yields

a = myStruct1(1)
b = myStruct2(1)
x1 =  ((b, (a, b)), (((a, (a,(b,b))), a), b))

@btime bar($x1) # 36.172 ns (0 allocations: 0 bytes)
@btime foo($x1) # 154.228 ns (12 allocations: 432 bytes)

On your side note. If you evaluate foo(x1) once, prior to the benchmark, both benchmarks yield the same result. Actually, this should not happen with BenchmarkTools. I have no idea why it does.

1 Like

Alternatively, recursive calls can also work well when iterating over heterogeneous tuples like yours:

@inline goo(x::Tuple) = _goo(0.0, x...)
@inline _goo(c, xi::Tuple, args...) = _goo(c + 1 / goo(xi), args...)
@inline _goo(c, xi, args...) = _goo(c + f(xi, rand()), args...)
@inline _goo(c) = c  # this is the final result

In my tests this is as efficient as @moeddel’s nice solution using @generated functions:

julia> x1 =  ((b, (a, b)), (((a, (a,(b,b))), a), b))

julia> @btime foo($x1)
  178.637 ns (12 allocations: 432 bytes)

julia> @btime bar($x1)
  37.526 ns (0 allocations: 0 bytes)

julia> @btime goo($x1)
  36.720 ns (0 allocations: 0 bytes)
2 Likes