Avoid memory allocation in passing arrays

Hello everyone,

I’m trying to figure out how to avoid unnecessary memory allocation in the following example (I’m using the Tullio package, but I think it also applies in more general contexts):

using Tullio
using LoopVectorization 

function funza_array_version(N, A)

indices = Array{Int64}(undef, 8)  

fill!(indices, 1)       # This are supposed to be random in general
    
variable = 0    
    
    for i = 1:N
        
    @tullio variable = A[q,indices[1],indices[2],indices[3],indices[4]]*A[q,indices[5],indices[6],indices[7],indices[8]] 
        
    end    
    
end    
using Random

A = rand!(zeros(5,5,5,5,5))
N = 10^5

@timev funza_array_version(N, A)

0.003100 seconds (100.00 k allocations: 1.526 MiB)
elapsed time (ns): 3099694
bytes allocated:   1600144
pool allocs:       100001

As you can see, an allocation is created for each iteration, in my opinion (but I could be wrong) because a copy of the array “indices” is created, that specifies which indices are to be used in the contraction.

Of course, it goes without saying that this is a nonsense example per se (besides being a trivial contraction, the indices are fixed at 1 instead of being random, but it is just to give an example), but it clearly shows what’s wrong.

In general, how do I do such a contraction with variable indices (randomly determined within the bounds of the array size) without allocating memory?

Thanks a lot to everyone!

What’s q? If global, could be causing issues.

You’re just rebinding the result of a matrix multiplication to a variable. You’re not going to get rid of those allocations, since you’re not writing to an existing array. Written like that, neither Tullio nor LoopVectorization are going to influence allocations. In this case, iirc Tullio will create the resulting array, which explains the 100_001 allocations (for N = 10^5 iterations).

On top of that, both Tullio as well as LoopVectorizations require some sort of predictable access pattern (most often close by data, to allow for SIMD and reduce cache misses). If the accesses are by definition random, you’re not going to get good speed either way.

I don’t understand why if I use explicit indices:

 @tullio variable = A[q,1,1,1,1]*A[q,1,1,1,1]

then I get:

julia> @timev funza_array_version(N, A)
  0.001012 seconds (1 allocation: 144 bytes)
elapsed time (ns): 1012500
bytes allocated:   144
pool allocs:       1

That is, if I use explicit indices then the memory is not allocated… I would like to do the same, but with random indices. Is that possible?

Julia can’t know whether the indices in indices are inbounds for A, so it’ll have to insert a bunch of bounds checks. If you hardcode the indices, the compiler can tell “well the array has at least one element since it has a dimension greater than 0” and it will actually just work. You can index with extra ones (but only ones!) and julia won’t care:

julia> a = Vector{Int}(undef, 5)     
5-element Vector{Int64}:             
 140520142667888                     
 140520142667952                     
 140520142668016                     
 140520142668080                     
 140520087110896                     
                                     
julia> a[1,1,1,1,1,1,1,1,1,1,1,1,1,1]
140520142667888                      

So when you hardcode those 1 in there, it’s the same thing. It can’t ever go out of bounds, so all bounds checks and potential error message allocations etc. are gone.


That aside, your function doesn’t make much sense to me. You’re recomputing the same thing over and over again for N iterations, with no change and not even saving the intermediate result. What are you really trying to achieve?

1 Like

I think @Sukera is right. I also think there are issues with q apparently being global and therefore type unstable. Within funza_array_version, compiler doesn’t know whether to expect a matrix multiplication or a scalar multiplication or who knows what else. That might also account for some allocations.

I tried with the macro @inbounds but I still have the same number of allocations…

you’re right: as I stated in the post this is a total nosense example, which is only meant to show the high number of allocations due to the circumstance in which the unsumed indices are variable by using Tullio

q should be interpreted by Tullio as a summed index (notice that it appears twice), so it is not global and this should not be the reason of the allocations

1 Like

To be 100% clear, this is what I meant:

julia> function funza_array_version(N, A)

       indices = Array{Int64}(undef, 8)

       fill!(indices, 2)       # This are supposed to be random in general

       variable = 0

           for i = 1:N
             for q=1:1:5
             variable += A[q,indices[1],indices[2],indices[3],indices[4]]*A[q,indices[5],indices[6],indices[7],indices[8]]
             end
           end
        end

funza_array_version (generic function with 1 method)

julia> @timev funza_array_version(N, A)
  0.000004 seconds (1 allocation: 144 bytes)
elapsed time (ns): 4100
bytes allocated:   144
pool allocs:       1

I was wondering if there is a way to do the same thing but using Tullio in order to do the contraction since it’s a little faster…

If I time running this just once, I see 1 allocation: 16 bytes, although ideally it would have none – I’m not so sure what’s causing that.

julia> @btime @tullio variable = $A[q,1,1,1,1] * $A[q,1,1,1,1]  # just one
  23.008 ns (1 allocation: 16 bytes)
1.650247116059946

julia> @btime @tullio variable := $A[q,1,1,1,1]^2  # equivalent, seems to behave better
  4.125 ns (0 allocations: 0 bytes)
1.650247116059946

Then it gets run in a loop, N times:

julia> @btime funza_array_version(1, $A)
  47.608 ns (2 allocations: 144 bytes)

julia> @btime ones(Int, 8);  # the function also makes this, once
  17.285 ns (1 allocation: 128 bytes) # and 144 = 128 + 16

julia> @btime funza_array_version(1_000, $A)
  24.916 μs (1001 allocations: 15.75 KiB)

julia> @btime funza_array_version(1_000_000, $A)
  24.120 ms (1000001 allocations: 15.26 MiB)

That matches what you see, I think. (Although when you hard-code the indices in the loop, the allocation goes away, unlike my single timing above). If I allow random indices and run just one:

julia> inds = rand(1:5, 8);  # or ones(Int, 8), same times

julia> f2(A,inds) = @tullio v2 := A[q,inds[1],inds[2],inds[3],inds[4]] * A[q,inds[5],inds[6],inds[7],inds[8]];

julia> f3(A,inds) = begin
       i1, i2, i3, i4, i5, i6, i7, i8 = inds
       @tullio v2 := A[q,$i1,$i2,$i3,$i4] * A[q,$i5,$i6,$i7,$i8]  # dollar means fixed not summed
       end;

julia> @btime f2($A, $inds)
  26.606 ns (1 allocation: 16 bytes)
1.2158793822673815

julia> @btime f3($A, $inds)
  46.459 ns (1 allocation: 16 bytes)
1.2158793822673815

This f2 still seems a slightly odd thing to write. It inserts many checks like (issubset)((Tullio.extremerange)(@view(inds[1])), (axes)(A, 2)) || (throw)("extrema of index inds[1] must fit within A"), which is a baroque way of doing bounds checking, since inds[1] is actually just one number… because it’s designed for things like A[q,inds[i],inds[i]] where you are indexing the indices, and doesn’t fully realise that they are just one number each. I’m a bit surprised f3 isnt’ faster.

However, I would probably write this operation as dot (without @inbounds unless you’re sure). This is simpler and faster:

julia> f4(A, inds) = @inbounds @views dot(A[:, inds[1:4]...], A[:, inds[5:8]...]);

julia> @btime f4($A, $inds)
  8.383 ns (0 allocations: 0 bytes)  # 10.511 ns without @inbounds
1.2158793822673815

julia> f5(A, inds) = @inbounds @views dot(A[:, inds[1],inds[2],inds[3],inds[4]], A[:, inds[5],inds[6],inds[7],inds[8]]);

julia> @btime f5($A, $inds)
  8.383 ns (0 allocations: 0 bytes)
1.2158793822673815

Edit: This hand-written loop also seems faster:

julia> f6(A, indices) = begin
                 variable = zero(eltype(A))
                 for q in 1:5
                    @inbounds variable += A[q,indices[1],indices[2],indices[3],indices[4]]*A[q,indices[5],indices[6],indices[7],indices[8]]
                 end
                 variable
               end;

julia> @btime f6($A, $inds)
  2.416 ns (0 allocations: 0 bytes)  # 14.111 ns without @inbounds
1.2158793822673815
7 Likes

Yours is a very complete answer: thank you very much for that.

Unfortunately, even calling your functions (using Tullio) inside a loop still allocates an amount of memory proportional to the iterations when using indices which are not “hard” (that is, 1,2 etc.).

For now I have solved the problem with some trivial for loops and using @inbounds, in this way it is almost as fast as with Tullio. I can’t use dot () because in reality the tensor contractions I have to do are much less trivial than the one shown in this example.

P.S. i don’t know why but i get weird errors when i use $…

Sure. It looks like the stray allocation has something to do with threading… this isn’t multi-threaded, but it does check the size to decide whether it should be:

julia> M = [1 2; 3 4.0];

julia> @btime @tullio _ = $M[q,1] * $M[q,1]
  22.381 ns (1 allocation: 16 bytes)
10.0

julia> @btime @tullio _ = $M[q,1] * $M[q,1]  threads=false
  1.583 ns (0 allocations: 0 bytes)
10.0

It should do the following, and if it doesn’t in some context, that’s a bug:

julia> j = 2;

julia> @tullio _[i] := M[i,$j]  # fixed j
2-element Vector{Float64}:
 2.0
 4.0

julia> @tullio _[i] := M[i,j]  # dummy j, summed over
2-element Vector{Float64}:
 3.0
 7.0

My benchmarks above also have $A which is used & removed by @btime, perhaps a confusing combination.

4 Likes

Although unfortunately the problem was not solved, your answers were very helpful.

Thanks so much again! I hope that someone (perhaps the creators of the package themselves) can clarify in the future the reason for such allocations.

I just saw that Tullio’s GitHub repository is yours, congratulations!

If you can “fix” this in the future let me know, as I think it’s a great package.

1 Like

Heh. Glad if it’s useful! I made https://github.com/mcabbott/Tullio.jl/issues/124 but haven’t really looked into it yet.

2 Likes