Performance of creating a tuple with a for loop

Yes and yes. Thank you for adding the (let’s say clarification).

while your here: I tried running your example (current 1.6 nightly). What do I need to fix locally?

julia> using LoopVectorization

julia> function AmulB!(C, A, B)
                  @avx for m ∈ axes(C,1), n ∈ axes(C,2)
                      Cₘₙ = zero(eltype(C))
                      for k ∈ axes(B,1)
                          Cₘₙ += A[m,k] * B[k,n]
                      end
                      C[m,n] = Cₘₙ
                  end
                  C
              end
AmulB! (generic function with 1 method)

julia> M = K = N = 4;

julia> A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);

julia>  AmulB!(C, A, B)
ERROR: Module IR does not contain specified entry function
Stacktrace:
 [1] assume
   @ ~\.julia\packages\SIMDPirates\EVSvY\src\llvm_utils.jl:308 [inlined]
 [2] macro expansion
   @ ~\.julia\packages\LoopVectorization\pHMnJ\src\reconstruct_loopset.jl:503 [inlined]
 [3] _avx_!(::Val{(0, 0, 0, 4)}, ::Type{Tuple{:numericconstant, Symbol("##zero#276"), LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000000, 0x0000000000000003, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000013, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000032, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x03), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x0000000000000132, 0x0000000000000003, 0x0000000000000000, 0x0000000000020301, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000004, LoopVectorization.compute, 0x00, 0x01), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x0000000000000012, 0x0000000000000003, 0x0000000000000000, 0x0000000000000005, LoopVectorization.memstore, 0x03, 0x04)}}, ::Type{Tuple{LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000103, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:B, Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000302, 0x0000000000000000), LoopVectorization.ArrayRefStruct{:C, Symbol("##vptr##_C")}(0x0000000000000101, 0x0000000000000102, 0x0000000000000000)}}, ::Type{Tuple{0, Tuple{}, Tuple{}, Tuple{}, Tuple{}, Tuple{(1, LoopVectorization.IntOrFloat)}, Tuple{}}}, ::Type{Tuple{:m, :n, :k}}, ::Tuple{VectorizationBase.StaticLowerUnitRange{1}, VectorizationBase.StaticLowerUnitRange{1}, VectorizationBase.StaticLowerUnitRange{1}}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1}, ::VectorizationBase.PackedStridedPointer{Float64, 1})
   @ LoopVectorization ~\.julia\packages\LoopVectorization\pHMnJ\src\reconstruct_loopset.jl:503
 [4] AmulB!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64})
   @ Main .\REPL[2]:2
 [5] top-level scope
   @ REPL[5]:1

I’m still working on adding Julia 1.6 support. I’m most of the way there.

3 Likes

I think that will just add back some of the overhead of global variables. e.g:

julia> a = 1; b = 2
2

julia> @btime aa + bb setup = (aa=a; bb=b)
  11.992 ns (0 allocations: 0 bytes)
3

julia> @btime a + b
  16.999 ns (0 allocations: 0 bytes)
3

oops I forgot the "$"s [thank you, correcting above].

When a and b are of types that support copy this is as the docs suggest:
@btime fn(x, y) setup = (x = copy($a); y = copy($b);)
Otherwise (with Tuple types or Symbols, for example) use
@btime fn(x, y) setup = (x = $a; y = $b;)

Without the $, it appears the effect is the global one, with the $ it appears the effect is hoisting, and using the Ref trick we see time on the order of 2-3 nanoseconds again.
(benchmarking is hard…).

julia> @btime ac.(a, b) setup = (a=$a; b=$b);
  0.028 ns (0 allocations: 0 bytes)

julia> @btime ac.(a, b) setup = (a=$(Ref(a))[]; b=$(Ref(b))[]);
  3.258 ns (0 allocations: 0 bytes)

Your benchmark earlier also featured semicolons in setup rather than commas. Any reason why this makes a such big difference (see below)? The only difference in the generated code is that in the ; case, the variables are passed as a :block, and otherwise as a :tuple. I can′ t see why this would have any effect on hoisting or globalness though.

julia> @btime ac.(a, b) setup = (a=$a, b=$b);
  55.319 ns (1 allocation: 48 bytes)

julia> @btime ac.(a, b) setup = (a=$a; b=$b);
  0.029 ns (0 allocations: 0 bytes)

about the comma vs the semicolon in setup

When you use the same varnames with and without $
@btime ac.(a, b) setup=(a=$a, b=$b)
Then the , is evaluable.

If you use different varnames then the ‘;’ is needed (the ‘,’ errors)
@btime ac.(x, y) setup=(x=$a; y=$b)

Here is an example where using the comma in setup=(_) instead of using the semicolon gives a benchmarking result where both the reported time and memory use increase.

julia> fn(a, b) = a > b ? a^b  : b^a
julia> a=(1, 2, 3, 2); b=(2, 1, 3, 2);

julia> @btime fn.(a, b) setup=(a=$a, b=$b)
  43.837 ns (1 allocation: 48 bytes)
(2, 2, 27, 4)

julia> @btime fn.(a, b) setup=(a=$a; b=$b)
  24.172 ns (0 allocations: 0 bytes)
(2, 2, 27, 4)

julia> # -------

julia> @benchmark fn.(a, b) setup=(a=$a; b=$b)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     24.597 ns (0.00% GC)
  median time:      25.101 ns (0.00% GC)
  mean time:        26.133 ns (0.00% GC)
  maximum time:     226.407 ns (0.00% GC)
  --------------
  samples:          100000
  evals/sample:     996

julia> @benchmark fn.(a, b) setup=(a=$a, b=$b)
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     46.465 ns (0.00% GC)
  median time:      48.283 ns (0.00% GC)
  mean time:        52.186 ns (2.97% GC)
  maximum time:     2.214 μs (96.94% GC)
  --------------
  samples:          95992
  evals/sample:     990

I think using the comma creates a named tuple, therefore the benchmark

julia> @btime fn.(a, b) setup=(a=$a, b=$b)
  43.837 ns (1 allocation: 48 bytes)
(2, 2, 27, 4)

is using the a and b defined in global scope.

1 Like