Performance of creating a tuple with a for loop

dogiv · September 14, 2020, 8:00pm

I have a function “complement” that takes two tuples of ints and does a simple comparison to generate a third tuple. I would prefer to write it so that it works for any length of tuple, but that kills the performance (10x slower). It looks like the problem is that with a loop, an array gets created first and then converted into a tuple.

Preferred implementation:

ac(a, b) = a == b ? a : 6 - a - b

function complement1(a, b)
    return Tuple(ac(a[k], b[k]) for k in 1:length(a))
end

julia> a = (1, 2, 3, 2); b = (2, 1, 3, 2);
julia> @btime complement1(a, b)
  212.952 ns (2 allocations: 160 bytes)
(3, 3, 3, 2)

Fast implementation:

function complement4(a, b)
    return (ac(a[1], b[1]), ac(a[2], b[2]), ac(a[3], b[3]), ac(a[4], b[4]))
end

julia> @btime complement4(a, b)
  22.970 ns (1 allocation: 48 bytes)
(3, 3, 3, 2)

Does anyone know some kind of trick to get the compiler to unroll the loop / create a tuple directly, so I can write the general version and still get close to the speed of the 4-tuple version?

btw I tried a couple of alternatives to the first version, but they didn’t help:

# adding type hints
function complement1a(a::NTuple{4, Int64}, b::NTuple{4, Int64})::NTuple{4, Int64}
    return Tuple(ac(a[k], b[k]) for k in 1:4)
end

# splatting the array before turning it into a tuple
function complement2(a, b)
    return tuple([ac(a[k], b[k]) for k in 1:4]...)
end

# zipping the inputs together first
ac2(a) = ac(a[1], a[2])
function complement3(a, b)
    return Tuple(ac2.(zip(a,b)))
end

julia> @btime complement1a($a, $b)
  212.000 ns (2 allocations: 160 bytes)
(3, 3, 3, 2)
julia> @btime complement2($a, $b)
  216.601 ns (2 allocations: 160 bytes)
(3, 3, 3, 2)
julia> @btime complement3($a, $b)
  269.972 ns (4 allocations: 416 bytes)
(3, 3, 3, 2)

(Using Julia 1.5.1 on Windows).

jlchan · September 14, 2020, 8:07pm

I found unzip on here a while back and it might help if you compose it with the other check.

julia> unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
 unzip (generic function with 1 method)

julia> a = (1, 2, 3, 2); b = (2, 1, 3, 2);

julia> unzip((a,b))
 ((1, 2), (2, 1), (3, 3), (2, 2))

julia> @btime unzip(($a,$b));
  8.356 ns (0 allocations: 0 bytes)

Edit: a more complete example using ac

julia> @btime map(x->ac(x...),unzip(($a,$b)))
  6.918 ns (0 allocations: 0 bytes)
 (3, 3, 3, 2)

pdeffebach · September 14, 2020, 8:10pm

I’ve always used ntuple when creating tuples. Maybe this will suit your use case?

dogiv · September 14, 2020, 8:11pm

This works, thanks!

unzip(a) = map(x->getfield.(a, x), fieldnames(eltype(a)))
function complement5(a, b)
    return Tuple(ac2.(unzip((a,b))))
end
julia> @btime complement5(a,b)
  30.320 ns (1 allocation: 48 bytes)
(3, 3, 3, 2)

Could you explain why that’s so much faster than regular zip? Is there a trade-off?

mike · September 14, 2020, 8:13pm

To expand on what @pdeffebach said

complement(a::NTuple{N}, b::NTuple{N}) where N = ntuple(k -> ac(a[k], b[k]), Val{N}())

is pretty fast

julia> @btime complement($(Ref(a))[], $(Ref(b))[])
  2.370 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

Edit:

And doesn’t actually need the Val{N}() either:

julia> complement_2(a::NTuple{N}, b::NTuple{N}) where N = ntuple(k -> ac(a[k], b[k]), N)
complement_2 (generic function with 1 method)

julia> @btime complement_2($(Ref(a))[], $(Ref(b))[])
  2.372 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

Updated benchmark results as pointed out by @tomerarnon since using $a instead of $(Ref(a))[] results in benchmarking a function that doesn’t actual do much, just returning a constant (I believe).

jlchan · September 14, 2020, 8:16pm

I’m not 100% sure - I don’t know what zip does under the hood.

Sounds like @pdeffebach and @mike’s approach wins the speed race pretty hands down though.

tomerarnon · September 14, 2020, 8:17pm

0.012 ns (0 allocations: 0 bytes)

Dubious benchmark alert!

julia> @btime complement($(Ref(a))[], $(Ref(b))[])
  2.614 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

But still lightening fast!

Edit: complement_2 is the same, by the way:

julia> @btime complement_2($(Ref(a))[], $(Ref(b))[])
  2.614 ns (0 allocations: 0 bytes)

mike · September 14, 2020, 8:18pm

Yeah, defintely. 2ns is plenty fast enough

dogiv · September 14, 2020, 8:25pm

Thanks, I’d never seen ntuple but it does seem to solve my problem. I’ll have to look into that some more and figure out what it’s doing!

mike · September 14, 2020, 8:32pm

It’s probably the coolest looking function in Base

function ntuple(f::F, n::Integer) where F
    t = n == 0  ? () :
        n == 1  ? (f(1),) :
        n == 2  ? (f(1), f(2)) :
        n == 3  ? (f(1), f(2), f(3)) :
        n == 4  ? (f(1), f(2), f(3), f(4)) :
        n == 5  ? (f(1), f(2), f(3), f(4), f(5)) :
        n == 6  ? (f(1), f(2), f(3), f(4), f(5), f(6)) :
        n == 7  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7)) :
        n == 8  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8)) :
        n == 9  ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9)) :
        n == 10 ? (f(1), f(2), f(3), f(4), f(5), f(6), f(7), f(8), f(9), f(10)) :
        _ntuple(f, n)
    return t
end

Vasily_Pisarev · September 14, 2020, 8:37pm

In this case, map or broadcast looks much cleaner:

complement(a, b) = map(ac, a, b)
# or
complement(a, b) = ac.(a, b)

Both don’t allocate with small tuples.

dogiv · September 14, 2020, 8:38pm

Can’t believe I didn’t think of that…

jlchan · September 14, 2020, 8:47pm

I have no words to describe this beauty

Henrique_Becker · September 14, 2020, 9:12pm

And some people have the audacity to suggest that the ternary operator leads to uglier code.

JeffreySarnoff · September 14, 2020, 9:43pm

Use this form for benchmarking very fast functions
(x and y are defined values for the arguments)

using BenchmarkTools
# note the semicolons after each setup var
# where x, y are of types that support `copy`, use
@btime fastfunction(a, b) setup=(a=copy($x); b=copy($y);)
# otherwise (with Tuple types and Symbols, for example) use
@btime fastfunction(a, b) setup=(a=$x; b=$y;)

dogiv · September 14, 2020, 9:53pm

That timing method didn’t seem to work:

julia> @btime ac.(a, b) setup=(a=(1, 2, 3, 2); b=(2, 1, 3, 2))
  0.001 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

The @code_llvm is mostly greek to me but it doesn’t look trivial:

;  @ C:\Users\erick\source\repos\SetJulia\tuple_compare.jl:40 within `complement7'
; Function Attrs: uwtable
define void @julia_complement7_10386([4 x i64]* noalias nocapture sret, [4 x i64]* nocapture nonnull readonly dereferenceable(32), [4 x i64]* nocapture nonnull readonly 
dereferenceable(32)) #0 {
top:
; ┌ @ broadcast.jl:837 within `materialize'
; │┌ @ broadcast.jl:1046 within `copy'
; ││┌ @ ntuple.jl:45 within `ntuple'
; │││┌ @ ntuple.jl:50 within `macro expansion'
; ││││┌ @ broadcast.jl:1046 within `#19'
; │││││┌ @ broadcast.jl:620 within `_broadcast_getindex'
; ││││││┌ @ broadcast.jl:644 within `_getindex'
; │││││││┌ @ broadcast.jl:599 within `_broadcast_getindex'
; ││││││││┌ @ tuple.jl:24 within `getindex'
           %3 = getelementptr inbounds [4 x i64], [4 x i64]* %1, i64 0, i64 0
; │││││││└└
; │││││││ @ broadcast.jl:644 within `_getindex' @ broadcast.jl:645
; │││││││┌ @ broadcast.jl:599 within `_broadcast_getindex'
; ││││││││┌ @ tuple.jl:24 within `getindex'
           %4 = getelementptr inbounds [4 x i64], [4 x i64]* %2, i64 0, i64 0
; ││││││└└└
; ││││││ @ broadcast.jl:621 within `_broadcast_getindex'
; ││││││┌ @ broadcast.jl:648 within `_broadcast_getindex_evalf'
; │││││││┌ @ C:\Users\erick\source\repos\SetJulia\tuple_compare.jl:8 within `ac'
; ││││││││┌ @ promotion.jl:398 within `=='
           %5 = bitcast i64* %3 to <4 x i64>*
           %6 = load <4 x i64>, <4 x i64>* %5, align 8
           %7 = bitcast i64* %4 to <4 x i64>*
           %8 = load <4 x i64>, <4 x i64>* %7, align 8
           %9 = icmp eq <4 x i64> %6, %8
; ││││││││└
          %10 = sub <4 x i64> <i64 6, i64 6, i64 6, i64 6>, %6
          %11 = sub <4 x i64> %10, %8
          %12 = select <4 x i1> %9, <4 x i64> %6, <4 x i64> %11
; └└└└└└└└
  %.sroa.0.0..sroa_idx = getelementptr inbounds [4 x i64], [4 x i64]* %0, i64 0, i64 0
  %13 = bitcast i64* %.sroa.0.0..sroa_idx to <4 x i64>*
  store <4 x i64> %12, <4 x i64>* %13, align 8
  ret void
}

jlchan · September 14, 2020, 10:08pm

I guess that explains

julia> @btime ntuple(x->x,10);
  0.034 ns (0 allocations: 0 bytes)

julia> @btime ntuple(x->x,11);
  419.266 ns (4 allocations: 336 bytes)

dogiv · September 14, 2020, 10:44pm

oops. It doesn’t seem to make a difference though:

julia> @btime ac.(a, b) setup=(a=(1, 2, 3, 2); b=(2, 1, 3, 2);)
  0.001 ns (0 allocations: 0 bytes)
(3, 3, 3, 2)

JeffreySarnoff · September 14, 2020, 11:22pm

the “setup” should use variable’s names
setup = (xvar=avar; yvar=bvar;)

julia> ac(a, b) = a == b ? a : 6 - a - b
julia> a=(1, 2, 3, 2); b=(2, 1, 3, 2);
julia> @btime ac.(x, y) setup=(x=$a; y=$b;)
  21.363 ns (1 allocation: 48 bytes)
(3, 3, 3, 2)

Elrod · September 15, 2020, 6:19am

If a CPU is running at 4 GHz (i.e., 4 clock cycles / ns), then 2ns means 8 clock cycles.
On Skylake-X, an fma instruction has a latency of 4 clock cycles, and the CPU can execute 2 of them per clock cycle.
Theoretically, that means it could start 8 fmas within the first nanosecond (4 cycles, 2 per cycle), and then all 8 of these would have finished executing by the end of the 2nd nanosecond.
Given AVX512, that means it could complete 128 Float64 floating point operations in 2 nanoseconds, or 64 Float64 operations on average per nanosecond.
While spending half that time not actually executing instructions at all. Running it for longer can get you closer to that 128/nanosecond:

julia> using LoopVectorization, BenchmarkTools

julia> function AmulB!(C, A, B)
           @avx for m ∈ axes(C,1), n ∈ axes(C,2)
               Cₘₙ = zero(eltype(C))
               for k ∈ axes(B,1)
                   Cₘₙ += A[m,k] * B[k,n]
               end
               C[m,n] = Cₘₙ
           end
           C
       end
AmulB! (generic function with 1 method)

julia> M = K = N = 72;

julia> A = rand(M, K); B = rand(K, N); C = Matrix{Float64}(undef, M, N);

julia> @benchmark AmulB!($C, $A, $B)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.117 μs (0.00% GC)
  median time:      6.151 μs (0.00% GC)
  mean time:        6.161 μs (0.00% GC)
  maximum time:     18.320 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> 2M*K*N / 6117
122.03629230014712

Average of 122 floating point operations completed per nanosecond.

A computer can get a lot done per nanosecond.
But anything under 1/GHz nanoseconds, i.e. less than a clock cycle, is unbelievable.

(Full disclosure, the computer I benchmarked on runs AVX512 code at 4.1 GHz rather than 4.0GHz; 4 is cleaner for explanations)

Topic		Replies	Views
Function array performance Performance	5	415	April 2, 2022
Assigning a Tuple to sub-entries of Tuple of Arrays New to Julia performance , array , tuple	2	486	November 8, 2019
Performance tips on turning arrays into tuples New to Julia	15	1142	February 20, 2021
"Composing" functions Performance	1	580	May 11, 2019
Should Julia be able to optimize away small temporary arrays? New to Julia question , performance , array , tuple	10	1225	July 23, 2022

Performance of creating a tuple with a for loop

Related topics