Memory allocations when performing operation on Tuple of structs

Hello,

I’m facing a memory allocation problem on an apparently simple case. I have the following struct

struct MyStruct{T1<:AbstractMatrix, T2}
    data::T1
    other::T2
end

where data is a matrix and other is something that will not be involved in the next operations, but we will see it will lead to memory allocations.

Now I show how I get memory allocations when performing operations on a Tuple of these objects, with different types for the other field.

using LinearAlgebra
using BenchmarkTools

struct MyStruct{T1<:AbstractMatrix, T2}
    data::T1
    other::T2
end

struct MyStructSum{T}
    ops::T
end

function LinearAlgebra.mul!(y::AbstractVector, a::MyStruct, x::AbstractVector)
    mul!(y, a.data, x)
    return y
end

function LinearAlgebra.mul!(y::AbstractVector, a::MyStructSum, x::AbstractVector)
    for op in a.ops
        mul!(y, op, x)
    end
    return y
end

T = Float64
N = 200
M = 10
ops = ntuple(i-> rand(T, N, N), Val(M))
vals = ntuple(i-> rand() > 0.5 ? rand(Int64) : rand(Float64), Val(M))

my_structs = ntuple(i-> MyStruct(ops[i], vals[i]), Val(M))
my_sum_struct = MyStructSum(my_structs)

typeof(my_sum_struct)

x = rand(T, N)
y = similar(x)

mul!(y, my_sum_struct, x);

@benchmark mul!($y, $my_sum_struct, $x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  37.297 ΞΌs … 135.552 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     37.902 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   38.379 ΞΌs Β±   2.357 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–‡β–ˆβ–ˆβ–†β–„β–ƒβ–        ▁▂▂▃▂▁                                      β–‚
  β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–†β–„β–†β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–‡β–†β–‡β–…β–‡β–‡β–†β–†β–†β–‡β–‡β–†β–†β–†β–…β–„β–…β–…β–…β–„β–„β–ƒβ–ƒβ–„β–„β–β–β–„β–„β–„β–β–ƒβ–„ β–ˆ
  37.3 ΞΌs       Histogram: log(frequency) by time      46.1 ΞΌs <

 Memory estimate: 1.86 KiB, allocs estimate: 19.

How course, if all the other fields have the same time, then I don’t have memory allocations. But in my complete implementation I need them, and in general they will be function types.

How can I solve this problem, knowing that the mul! operation is performed only in the data field, and it doesn’t use there other at all?

my_sum_struct.ops is a heterogeneous tuple, as it contains elements of both MyStruct{Matrix{Float64}, Float64} and MyStruct{Matrix{Float64}, Int64}. Iterating over heterogeneous tuples is type unstable as the compiler can’t figure out the type of my_sum_struct.ops[i] if i is a runtime variable. The solution is to unroll the loop, however this will work only if the length of my_sum_struct.ops is not too large. One simple way to do it is with ntuple

@inline Val_length(::NTuple{N, Any}) where N = Val(N)

function mul2!(y::AbstractVector, a::MyStructSum, x::AbstractVector) 
    VN = Val_length(a.ops)
    ntuple(VN) do i
        @inline
        mul!(y, a.ops[i], x)
    end
    return y
end
julia> @btime mul2!($y, $my_sum_struct, $x);
  35.600 ΞΌs (0 allocations: 0 bytes)
1 Like

Why are we constrained by this? What happens if we have large Tuples? Is it only a compilation issue, or also a run issue? What is the order of magnitude of the Tuple length to be considered ok?

BTW, thanks for the working example. But why do we need to define a dedicated function for returning the length of a tuple? Is that not known at compile time and thus just writing VN = Val(length(a.ops)) would work?

In your particular case looks like the compiler gives up at N>10 with ntuple. But a @generated function seems to do the trick for much larger N:

@generated function mul3!(y::AbstractVector, a::MyStructSum, x::AbstractVector, ::Val{N}) where N
    quote 
        Base.@nexprs $N i -> begin
            @inline 
            mul!(y, a.ops[i], x)
        end
        return y
    end
end

T = Float64;
N = 200
M = 200
ops = ntuple(i-> rand(T, N, N), Val(M))
vals = ntuple(i-> rand() > 0.5 ? rand(Int64) : rand(Float64), Val(M))

my_structs = ntuple(i-> MyStruct(ops[i], vals[i]), Val(M))
my_sum_struct = MyStructSum(my_structs)

yields

julia> using ChairMarks
julia> @b mul3!($(y, my_sum_struct, x, Val(length(my_sum_struct.ops)))...)
2.473 ms

And you are right, Val(length(a.ops)) is enough as it is known at compile time.

1 Like

Another way to do this is with foreach

1 Like

foreach looks indeed a better option than ntuple, but it still gives up at M>32 and it starts allocating. For M=33

julia> mul4!(y::AbstractVector, a::MyStructSum, x::AbstractVector) = foreach(ops -> mul!(y, ops, x), a.ops)
mul4! (generic function with 1 method)

julia> @b mul4!($(y, my_sum_struct, x)...)
122.300 ΞΌs (64 allocs: 18.000 KiB)
1 Like

Would it be too much of an edge case for the compiler to consider small union splitting for looping over a heterogenous tuple? In this case, since there are only two types, it seems like it would be possible to say my_sum_struct.ops[i]::Union{MyStruct{Matrix{Float64}, Float64}, MyStruct{Matrix{Float64}, Int64}}.

Thank you! It worked.

BTW, is the following code still good?

@generated function LinearAlgebra.mul!(y::AbstractVector, a::MyStructSum, x::AbstractVector)
    ops_types = a.parameters[1].parameters
    N = length(ops_types)  # Get the length of the tuple
    quote
        for i = 1:$N
            mul!(y, a.ops[i], x)
        end
        return y
    end
end

I still have zero allocations and there is no need to pass the length of the tuple as an argument.

I have just few additional questions. What is the purpose of using Base.@nexprs here? Why does the foreach function perform better, up to a defined length of the tuple? Does the @generated case have a limit in the length of the tuple?

That looks like a valid function to me, but you should test if it eventually gives up on unrolling the loop and ends up allocating. You could alternatively also do the following if you want to avoid explicitly passing Val(length(a.ops)) :

LinearAlgebra.mul!(y::AbstractVector, a::MyStructSum, x::AbstractVector) = LinearAlgebra.mul!(y, a, x, Val(length(a.ops)))
@generated function LinearAlgebra.mul!(y::AbstractVector, a::MyStructSum, x::AbstractVector, ::Val{N}) where N
    quote 
        Base.@nexprs $N i -> begin
            @inline 
            mul!(y, a.ops[i], x)
        end
        return y
    end
end

whatever version you prefer. Base.@nexprs is in charge of the unrolling. For example, Base.@nexprs 3 i -> mul!(y, a.ops[i], x) will literally generate the following code:

mul!(y, a.ops[1], x)
mul!(y, a.ops[2], x)
mul!(y, a.ops[3], x)

So I don’t expect this version to allocate memory; however, how large do you expect the length of the tuple to be? I guess you will see large compilation times if it’s really long (I don’t have an intuition here, you should test it).

As for foreach, it looks like it has a hard-coded manual unrolling up to N=31, and then it loops over the remaining elements of the tuple. You can see it here.

1 Like

Thank you for the reply. My Tuple may have an arbitrary number of elements, but usually less the 20 I would say. But it is better to support an arbitrary number of elements.