Defining function as scalar vs fusing with two Ref's, significant speed difference

cosmia · September 16, 2021, 3:54pm

I wonder why there are significant differences in performance between these two functions. The core functions are the tt and tt1, one in which I use @. to broadcast the results, and the other where I write everything as scalars and use tt1. to call it.

Interestingly, the number of allocations and size is the same.

This speed difference does not happen if I only have one vector to Ref (ie, if the functions don’t have Z as argument). In my real application, X and Z are custom structs with many fields, so breaking it down in singletons would not be practical.

using BenchmarkTools
function test(X,Z, Y)
    ret = Array{Float64,2}(undef,length(Y), 1000)
    for k = 1:1000
        ret[:,k] = tt.(Ref(X),Ref(Z),Y)
    end
    return ret
end
function test1(X,Z, Y)
    ret = Array{Float64,2}(undef, length(Y), 1000)
    for k = 1:1000
        ret[:,k] = tt1(X,Z,Y)
    end
    return ret
end
function tt(X, Z, Y)
    exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
end
function tt1(X, Z, Y)
    @. exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
end

Y=rand(10000);
@btime test([2.0, 5.0],[5.0, 10.0],Y); #2.137 s (2004 allocations: 152.66 MiB)
@btime test1([2.0, 5.0],[5.0, 10.0],Y); #28.084 ms (2004 allocations: 152.66 MiB)

Elrod · September 16, 2021, 6:26pm

LLVM should hoist X[1] ^ X[2] and Z[1] ^ Z[2] out of the broadcast loop in the test1 example, hence evaluating it only once per call to tt1.

Try marking tt with @inline to expose its contents to the compiler, using tuples instead of arrays for X and Y, and finally using broadcasted assignment .=:

julia> function test3(X,Z, Y)
           ret = Array{Float64,2}(undef,length(Y), 1000)
           for k = 1:1000
               ret[:,k] .= tt.(Ref(X),Ref(Z),Y)
           end
           return ret
       end
test3 (generic function with 1 method)

julia> @inline function tt(X, Z, Y)
           exp(X[1]*2 +  X[1] ^X[2] + Z[1]^Z[2]) + Y
       end
tt (generic function with 1 method)

julia> @btime test3((2.0, 5.0), (5.0, 10.0), $Y); #28.084 ms (2004 allocations: 152.66 MiB)
  56.652 ms (2 allocations: 76.29 MiB)

This is faster for me than test1:

julia> @btime test([2.0, 5.0],[5.0, 10.0],Y); #2.137 s (2004 allocations: 152.66 MiB)
  1.093 s (4004 allocations: 152.66 MiB)

julia> @btime test1([2.0, 5.0],[5.0, 10.0],Y); #28.084 ms (2004 allocations: 152.66 MiB)
  66.859 ms (2004 allocations: 152.63 MiB)

Of course, the exp part should also be possible to lift out of the loop, but it doesn’t.

Topic		Replies	Views
What's the "right" way to broadcast vector-valued functions? New to Julia broadcast , array	4	1596	November 6, 2019
Broadcasting slower than for-loop New to Julia	6	425	December 13, 2023
Should scalar calculation in Broadcast be "lazy"? Internals & Design broadcast	12	906	August 10, 2020
Marking types as scalar for broadcasting, Ref vs. Tuple? General Usage	2	690	September 24, 2019
Confusion on performance when using the broadcasting macro @. vs explicit . operators Performance	7	166	March 27, 2025

Defining function as scalar vs fusing with two Ref's, significant speed difference

Related topics