If I time running this just once, I see 1 allocation: 16 bytes
, although ideally it would have none – I’m not so sure what’s causing that.
julia> @btime @tullio variable = $A[q,1,1,1,1] * $A[q,1,1,1,1] # just one
23.008 ns (1 allocation: 16 bytes)
1.650247116059946
julia> @btime @tullio variable := $A[q,1,1,1,1]^2 # equivalent, seems to behave better
4.125 ns (0 allocations: 0 bytes)
1.650247116059946
Then it gets run in a loop, N
times:
julia> @btime funza_array_version(1, $A)
47.608 ns (2 allocations: 144 bytes)
julia> @btime ones(Int, 8); # the function also makes this, once
17.285 ns (1 allocation: 128 bytes) # and 144 = 128 + 16
julia> @btime funza_array_version(1_000, $A)
24.916 μs (1001 allocations: 15.75 KiB)
julia> @btime funza_array_version(1_000_000, $A)
24.120 ms (1000001 allocations: 15.26 MiB)
That matches what you see, I think. (Although when you hard-code the indices in the loop, the allocation goes away, unlike my single timing above). If I allow random indices and run just one:
julia> inds = rand(1:5, 8); # or ones(Int, 8), same times
julia> f2(A,inds) = @tullio v2 := A[q,inds[1],inds[2],inds[3],inds[4]] * A[q,inds[5],inds[6],inds[7],inds[8]];
julia> f3(A,inds) = begin
i1, i2, i3, i4, i5, i6, i7, i8 = inds
@tullio v2 := A[q,$i1,$i2,$i3,$i4] * A[q,$i5,$i6,$i7,$i8] # dollar means fixed not summed
end;
julia> @btime f2($A, $inds)
26.606 ns (1 allocation: 16 bytes)
1.2158793822673815
julia> @btime f3($A, $inds)
46.459 ns (1 allocation: 16 bytes)
1.2158793822673815
This f2
still seems a slightly odd thing to write. It inserts many checks like (issubset)((Tullio.extremerange)(@view(inds[1])), (axes)(A, 2)) || (throw)("extrema of index inds[1] must fit within A")
, which is a baroque way of doing bounds checking, since inds[1]
is actually just one number… because it’s designed for things like A[q,inds[i],inds[i]]
where you are indexing the indices, and doesn’t fully realise that they are just one number each. I’m a bit surprised f3
isnt’ faster.
However, I would probably write this operation as dot
(without @inbounds
unless you’re sure). This is simpler and faster:
julia> f4(A, inds) = @inbounds @views dot(A[:, inds[1:4]...], A[:, inds[5:8]...]);
julia> @btime f4($A, $inds)
8.383 ns (0 allocations: 0 bytes) # 10.511 ns without @inbounds
1.2158793822673815
julia> f5(A, inds) = @inbounds @views dot(A[:, inds[1],inds[2],inds[3],inds[4]], A[:, inds[5],inds[6],inds[7],inds[8]]);
julia> @btime f5($A, $inds)
8.383 ns (0 allocations: 0 bytes)
1.2158793822673815
Edit: This hand-written loop also seems faster:
julia> f6(A, indices) = begin
variable = zero(eltype(A))
for q in 1:5
@inbounds variable += A[q,indices[1],indices[2],indices[3],indices[4]]*A[q,indices[5],indices[6],indices[7],indices[8]]
end
variable
end;
julia> @btime f6($A, $inds)
2.416 ns (0 allocations: 0 bytes) # 14.111 ns without @inbounds
1.2158793822673815