Unnecessary `where {T}` causes huge performance drop?

This might be a silly question but I something which I do not understand. I was playing around with different implementations of a function to check the performance changes and sometimes I left an unnecessary where {T} in the function line. Then I was wondering about weird results in the @btime output and could not reproduce the fastest run, until I figured out that where {T} has in an impact, even if no T is defined or used anywhere:

function f1(x) where {T}
    return x*2
end

function f2(x)
    return x*2
end

function f3(::Type{T}, x) where {T}
    return x*2
end

I thought all three functions will compile to the same machine code but f1 does is not playing the game:

a = rand(1000);

@btime f1.($a);
  14.736 ÎĽs (2001 allocations: 39.19 KiB)

@btime f2.($a);
  471.755 ns (1 allocation: 7.94 KiB)

@btime f3.(Bool, $a);
  445.223 ns (1 allocation: 7.94 KiB)

What are those allocations and what’s happening here?

2 Likes
julia> @code_llvm f1(1.0)

;  @ REPL[1]:2 within `f1'
; Function Attrs: uwtable
define nonnull %jl_value_t addrspace(10)* @japi3_f1_16234(%jl_value_t addrspace(10)**, %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %4 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %2, %jl_value_t addrspace(10)*** %4, align 8
  %5 = call %jl_value_t*** inttoptr (i64 1801343456 to %jl_value_t*** ()*)() #3
  %6 = bitcast %jl_value_t addrspace(10)** %2 to double addrspace(10)**
  %7 = load double addrspace(10)*, double addrspace(10)** %6, align 8
; ┌ @ promotion.jl:314 within `*' @ float.jl:399
   %8 = load double, double addrspace(10)* %7, align 8
   %9 = fmul double %8, 2.000000e+00
; â””
  %10 = bitcast %jl_value_t*** %5 to i8*
  %11 = call noalias nonnull %jl_value_t addrspace(10)* @jl_gc_pool_alloc(i8* %10, i32 1744, i32 16) #1
  %12 = bitcast %jl_value_t addrspace(10)* %11 to %jl_value_t addrspace(10)* addrspace(10)*
  %13 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)* addrspace(10)* %12, i64 -1
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 114258656 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)* addrspace(10)* %13
  %14 = bitcast %jl_value_t addrspace(10)* %11 to double addrspace(10)*
  store double %9, double addrspace(10)* %14, align 8
  ret %jl_value_t addrspace(10)* %11
}

julia> @code_llvm f2(1.0)

;  @ REPL[2]:2 within `f2'
; Function Attrs: uwtable
define double @julia_f2_16235(double) #0 {
top:
; ┌ @ promotion.jl:314 within `*' @ float.jl:399
   %1 = fmul double %0, 2.000000e+00
; â””
  ret double %1
}

For me this looks like LLVM is creating a Float64 from whatever is coming, but its more guessing than understanding.

I don’t understand why the LLVM code differs at all :confused:

1 Like

I also find it strange, since

julia> @code_warntype f1(1.0)
Variables
  #self#::Core.Compiler.Const(f1, false)
  x::Float64

Body::Float64
1 ─ %1 = (x * 2)::Float64
└──      return %1

julia> @code_warntype f2(1.0)
Variables
  #self#::Core.Compiler.Const(f2, false)
  x::Float64

Body::Float64
1 ─ %1 = (x * 2)::Float64
└──      return %1

I can replicate this problem on v"1.3.0-rc4.1". Also note that broadcasting is not needed for an MWE, as

julia> @btime f1(1.0)
  15.680 ns (1 allocation: 16 bytes)
2.0

julia> @btime f2(1.0)
  0.027 ns (0 allocations: 0 bytes)
2.0

Please check if there is an existing issue about this, and if not, open one.

2 Likes

Thanks for the feedback, I was using 1.3rc2.

Yes the broadcasting was a copy paste leftover :wink:

I’ll check older Julia versions too and look through the issues then.

1 Like

I suspect it is because @code_warntype is lying to you
https://github.com/JuliaLang/julia/pull/32817

1 Like

Alright, I could not find anything related so I quickly opened an issue.

https://github.com/JuliaLang/julia/issues/33847

Update: I did a silly copy&paste mistake, which made me believe that in Julia 0.7 it’s “OK”. It is not… Julia 0.7 shows the same output.

3 Likes