Performance of creating a tuple with a for loop

I have a function “complement” that takes two tuples of ints and does a simple comparison to generate a third tuple. I would prefer to write it so that it works for any length of tuple, but that kills the performance (10x slower). It looks like the problem is that with a loop, an array gets created first and then converted into a tuple.

Preferred implementation:

ac(a, b) = a == b ? a : 6 - a - b

function complement1(a, b)
    return Tuple(ac(a[k], b[k]) for k in 1:length(a))
end

julia> a = (1, 2, 3, 2); b = (2, 1, 3, 2);
julia> @btime complement1(a, b)
  212.952 ns (2 allocations: 160 bytes)
(3, 3, 3, 2)

Fast implementation:

function complement4(a, b)
    return (ac(a[1], b[1]), ac(a[2], b[2]), ac(a[3], b[3]), ac(a[4], b[4]))
end

julia> @btime complement4(a, b)
  22.970 ns (1 allocation: 48 bytes)
(3, 3, 3, 2)

Does anyone know some kind of trick to get the compiler to unroll the loop / create a tuple directly, so I can write the general version and still get close to the speed of the 4-tuple version?

btw I tried a couple of alternatives to the first version, but they didn’t help:

# adding type hints
function complement1a(a::NTuple{4, Int64}, b::NTuple{4, Int64})::NTuple{4, Int64}
    return Tuple(ac(a[k], b[k]) for k in 1:4)
end

# splatting the array before turning it into a tuple
function complement2(a, b)
    return tuple([ac(a[k], b[k]) for k in 1:4]...)
end

# zipping the inputs together first
ac2(a) = ac(a[1], a[2])
function complement3(a, b)
    return Tuple(ac2.(zip(a,b)))
end

julia> @btime complement1a($a, $b)
  212.000 ns (2 allocations: 160 bytes)
(3, 3, 3, 2)
julia> @btime complement2($a, $b)
  216.601 ns (2 allocations: 160 bytes)
(3, 3, 3, 2)
julia> @btime complement3($a, $b)
  269.972 ns (4 allocations: 416 bytes)
(3, 3, 3, 2)

(Using Julia 1.5.1 on Windows).

1 Like

I found unzip on here a while back and it might help if you compose it with the other check.

julia> unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
 unzip (generic function with 1 method)

julia> a = (1, 2, 3, 2); b = (2, 1, 3, 2);

julia> unzip((a,b))
 ((1, 2), (2, 1), (3, 3), (2, 2))

julia> @btime unzip(($a,$b));
  8.356 ns (0 allocations: 0 bytes)

Edit: a more complete example using ac

julia> @btime map(x->ac(x...),unzip(($a,$b)))
  6.918 ns (0 allocations: 0 bytes)
 (3, 3, 3, 2)
3 Likes

I’ve always used ntuple when creating tuples. Maybe this will suit your use case?

4 Likes

This works, thanks!

unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
function complement5(a, b)
    return Tuple(ac2.(unzip((a,b))))
end
julia> @btime complement5(a,b)
  30.320 ns (1 allocation: 48 bytes)
(3, 3, 3, 2)

Could you explain why that’s so much faster than regular zip? Is there a trade-off?

To expand on what @pdeffebach said

complement(a::NTuple{N}, b::NTuple{N}) where N = ntuple(k -> ac(a[k], b[k]), Val{N}())

is pretty fast

julia> @btime complement($(Ref(a))[], $(Ref(b))[])
  2.370 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

Edit:

And doesn’t actually need the Val{N}() either:

julia> complement_2(a::NTuple{N}, b::NTuple{N}) where N = ntuple(k -> ac(a[k], b[k]), N)
complement_2 (generic function with 1 method)

julia> @btime complement_2($(Ref(a))[], $(Ref(b))[])
  2.372 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

Updated benchmark results as pointed out by @tomerarnon since using $a instead of $(Ref(a))[] results in benchmarking a function that doesn’t actual do much, just returning a constant (I believe).

5 Likes

I’m not 100% sure - I don’t know what zip does under the hood.

Sounds like @pdeffebach and @mike’s approach wins the speed race pretty hands down though.

0.012 ns (0 allocations: 0 bytes)

Dubious benchmark alert!

julia> @btime complement($(Ref(a))[], $(Ref(b))[])
  2.614 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

But still lightening fast!

Edit: complement_2 is the same, by the way:

julia> @btime complement_2($(Ref(a))[], $(Ref(b))[])
  2.614 ns (0 allocations: 0 bytes)
3 Likes

Yeah, defintely. 2ns is plenty fast enough :smiley:

2 Likes

Thanks, I’d never seen ntuple but it does seem to solve my problem. I’ll have to look into that some more and figure out what it’s doing!

It’s probably the coolest looking function in Base :rofl:

function ntuple(f::F, n::Integer) where F
    t = n == 0  ? () :
        n == 1  ? (f(1),) :
        n == 2  ? (f(1), f(2)) :
        n == 3  ? (f(1), f(2), f(3)) :
        n == 4  ? (f(1), f(2), f(3), f(4)) :
        n == 5  ? (f(1), f(2), f(3), f(4), f(5)) :
        n == 6  ? (f(1), f(2), f(3), f(4), f(5), f(6)) :
        n == 7  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7)) :
        n == 8  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8)) :
        n == 9  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9)) :
        n == 10 ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9), f(10)) :
        _ntuple(f, n)
    return t
end
6 Likes

In this case, map or broadcast looks much cleaner:

complement(a, b) = map(ac, a, b)
# or
complement(a, b) = ac.(a, b)

Both don’t allocate with small tuples.

9 Likes

Can’t believe I didn’t think of that…

1 Like

I have no words to describe this beauty

3 Likes

And some people have the audacity to suggest that the ternary operator leads to uglier code.

4 Likes

Use this form for benchmarking very fast functions
(x and y are defined values for the arguments)

using BenchmarkTools
# note the semicolons after each setup var
# where x, y are of types that support `copy`, use
@btime fastfunction(a, b) setup=(a=copy($x); b=copy($y);)
# otherwise (with Tuple types and Symbols, for example) use
@btime fastfunction(a, b) setup=(a=$x; b=$y;)

That timing method didn’t seem to work:

julia> @btime ac.(a, b) setup=(a=(1, 2, 3, 2); b=(2, 1, 3, 2))
  0.001 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

The @code_llvm is mostly greek to me but it doesn’t look trivial:

;  @ C:\Users\erick\source\repos\SetJulia\tuple_compare.jl:40 within `complement7'
; Function Attrs: uwtable
define void @julia_complement7_10386([4 x i64]* noalias nocapture sret, [4 x i64]* nocapture nonnull readonly dereferenceable(32), [4 x i64]* nocapture nonnull readonly 
dereferenceable(32)) #0 {
top:
; ┌ @ broadcast.jl:837 within `materialize'
; │┌ @ broadcast.jl:1046 within `copy'
; ││┌ @ ntuple.jl:45 within `ntuple'
; │││┌ @ ntuple.jl:50 within `macro expansion'
; ││││┌ @ broadcast.jl:1046 within `#19'
; │││││┌ @ broadcast.jl:620 within `_broadcast_getindex'
; ││││││┌ @ broadcast.jl:644 within `_getindex'
; │││││││┌ @ broadcast.jl:599 within `_broadcast_getindex'
; ││││││││┌ @ tuple.jl:24 within `getindex'
           %3 = getelementptr inbounds [4 x i64], [4 x i64]* %1, i64 0, i64 0
; │││││││└└
; │││││││ @ broadcast.jl:644 within `_getindex' @ broadcast.jl:645
; │││││││┌ @ broadcast.jl:599 within `_broadcast_getindex'
; ││││││││┌ @ tuple.jl:24 within `getindex'
           %4 = getelementptr inbounds [4 x i64], [4 x i64]* %2, i64 0, i64 0
; ││││││└└└
; ││││││ @ broadcast.jl:621 within `_broadcast_getindex'
; ││││││┌ @ broadcast.jl:648 within `_broadcast_getindex_evalf'
; │││││││┌ @ C:\Users\erick\source\repos\SetJulia\tuple_compare.jl:8 within `ac'
; ││││││││┌ @ promotion.jl:398 within `=='
           %5 = bitcast i64* %3 to <4 x i64>*
           %6 = load <4 x i64>, <4 x i64>* %5, align 8
           %7 = bitcast i64* %4 to <4 x i64>*
           %8 = load <4 x i64>, <4 x i64>* %7, align 8
           %9 = icmp eq <4 x i64> %6, %8
; ││││││││└
          %10 = sub <4 x i64> <i64 6, i64 6, i64 6, i64 6>, %6
          %11 = sub <4 x i64> %10, %8
          %12 = select <4 x i1> %9, <4 x i64> %6, <4 x i64> %11
; └└└└└└└└
  %.sroa.0.0..sroa_idx = getelementptr inbounds [4 x i64], [4 x i64]* %0, i64 0, i64 0
  %13 = bitcast i64* %.sroa.0.0..sroa_idx to <4 x i64>*
  store <4 x i64> %12, <4 x i64>* %13, align 8
  ret void
}

I guess that explains

julia> @btime ntuple(x->x,10);
  0.034 ns (0 allocations: 0 bytes)

julia> @btime ntuple(x->x,11);
  419.266 ns (4 allocations: 336 bytes)
1 Like

oops. It doesn’t seem to make a difference though:

julia> @btime ac.(a, b) setup=(a=(1, 2, 3, 2); b=(2, 1, 3, 2);)
  0.001 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

the “setup” should use variable’s names
setup = (xvar=avar; yvar=bvar;)

julia> ac(a, b) = a == b ? a : 6 - a - b
julia> a=(1, 2, 3, 2); b=(2, 1, 3, 2);
julia> @btime ac.(x, y) setup=(x=$a; y=$b;)
  21.363 ns (1 allocation: 48 bytes)
(3, 3, 3, 2)
1 Like

If a CPU is running at 4 GHz (i.e., 4 clock cycles / ns), then 2ns means 8 clock cycles.
On Skylake-X, an fma instruction has a latency of 4 clock cycles, and the CPU can execute 2 of them per clock cycle.
Theoretically, that means it could start 8 fmas within the first nanosecond (4 cycles, 2 per cycle), and then all 8 of these would have finished executing by the end of the 2nd nanosecond.
Given AVX512, that means it could complete 128 Float64 floating point operations in 2 nanoseconds, or 64 Float64 operations on average per nanosecond.
While spending half that time not actually executing instructions at all. Running it for longer can get you closer to that 128/nanosecond:

julia> using LoopVectorization, BenchmarkTools

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(C,1), n ∈ axes(C,2)
               Cₘₙ = zero(eltype(C))
               for k ∈ axes(B,1)
                   Cₘₙ += A[m,k] * B[k,n]
               end
               C[m,n] = Cₘₙ
           end
           C
       end
AmulB! (generic function with 1 method)

julia> M = K = N = 72;

julia> A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.117 μs (0.00% GC)
  median time:      6.151 μs (0.00% GC)
  mean time:        6.161 μs (0.00% GC)
  maximum time:     18.320 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> 2M*K*N / 6117
122.03629230014712

Average of 122 floating point operations completed per nanosecond.

A computer can get a lot done per nanosecond. :wink:
But anything under 1/GHz nanoseconds, i.e. less than a clock cycle, is unbelievable.

(Full disclosure, the computer I benchmarked on runs AVX512 code at 4.1 GHz rather than 4.0GHz; 4 is cleaner for explanations)

7 Likes