Confusion on performance when using the broadcasting macro @. vs explicit . operators

Hi everyone,

I am confused about the performance of the following code when using the @. macro as opposed to manually adding . before each operator. Below is a MWE

function get_f(θ, ϕ)
    α0 = 1.5
    θ1 = π/4
    ϕ1 = π
    σ = 0.2

    return α0 .* exp.((-(θ .- θ1) .^ 2 .- (ϕ .- ϕ1) .^ 2) ./ (2 .* σ .^ 2))
end

function get_f1(θ, ϕ)
    α0 = 1.5
    θ1 = π/4
    ϕ1 = π
    σ = 0.2

    return @. α0 * exp((-(θ - θ1)^2 - (ϕ - ϕ1)^2) / (2 * σ^2))
end

nθ = nϕ = 100

θ = range(0, π, nθ)
ϕ = range(0, 2π, nϕ)

julia> @btime get_f($θ, $ϕ')
  69.291 μs (8 allocations: 79.95 KiB)

julia> @btime get_f1($θ, $ϕ')
  113.778 μs (8 allocations: 78.22 KiB)

My understanding is that the two functions should perform exactly the same, but the second function is almost 2x slower even if the number of allocations is the same. Is the @. macro broadcasting something extra that I have omitted in the first function? And why would the extra broadcastings slow down performance?

Thanks for the help.

and are missing, and that’ll make a difference in the benchmark. Regardless, you didn’t dot the unary - call, so it materialized -(θ .- θ1) .^ 2 as an input for the outer fused broadcast. Not exactly sure why because CPUs are strange beasts, but it’s plausible that computing an element of a StepRange instead of loading an element of a Vector took more time.

I edited my post to include nθ, nϕ , although I think this shouldn’t affect the results.

What does indeed affect them is dotting the unary - call, as you said. I am still perplexed on why does this make such a difference though and what should be the optimal way of evaluating such expressions, given that adding manually too many . is cumbersome, ugly and prone to errors while using the @. macro could lead to worst performance.
Here is the new version with an additional . before -(θ .- θ1) .^ 2, which confirms that this was the difference between get_f and get_f1.

function get_f2(θ, ϕ)
    α0 = 1.5
    θ1 = π/4
    ϕ1 = π
    σ = 0.2

    return α0 .* exp.((.-(θ .- θ1) .^ 2 .- (ϕ .- ϕ1) .^ 2) ./ (2 .* σ .^ 2))
end

julia> @btime get_f2($θ, $ϕ')
  114.791 μs (10 allocations: 78.34 KiB)

Note that full broadcasting isn’t always the fastest option. Notice that, with full dots, it computes 2*N^2 squared differences and N^2 exps. It can actually get away with N times fewer, the only thing it needs N^2 of are multiplications. Your first version (with the missing .) partially realized this, which was why it was faster. Here’s a version that goes further, spending a little extra memory to save a lot of computations:

function get_f3(θ, ϕ)
  α0 = 1.5
  θ1 = π/4
  ϕ1 = π
  σ = 0.2

  expΔθ²σ² = exp.(abs2.(θ .- θ1) ./ (-2 * σ ^ 2))
  expΔϕ²σ² = exp.(abs2.(ϕ .- ϕ1) ./ (-2 * σ ^ 2))

  return α0 .* expΔθ²σ² .* expΔϕ²σ² # could factor the α0 into one of the other terms to save a tiny bit more
end
julia> @btime get_f($θ, $ϕ');
  31.066 μs (7 allocations: 80.02 KiB)

julia> @btime get_f1($θ, $ϕ');
  52.665 μs (3 allocations: 78.20 KiB)

julia> @btime get_f3($θ, $ϕ');
  4.783 μs (7 allocations: 80.03 KiB)
4 Likes

The only reliable effect of broadcast loop fusion is saving intermediate materializations, which usually allocates by falling back to Array. As you can see in your own benchmark (not sure what version of Julia because the numbers differ for me), get_f1 and get_f2’s fully fused loop indeed allocated less memory than the interrupted loop in get_f.

Saving intermediate materializations or allocations is not guaranteed to save time because of the iterated operations (mikmoore demonstrates it perfectly) or the data structures involved. I could make a lazy vector structure that maliciously pauses execution (Libc.systemsleep) for a day per getindex call, so materializing or just collecting into a Vector before a multidimensional broadcast loop would save a lot of time despite the extra allocation. Your use of StepRangeLen inputs is just a far less extreme example of that, and it’s not strictly slower than all Arrays:

julia> @btime get_f1($θ, $(ϕ')); # StepRangeLen inputs
  88.000 μs (3 allocations: 78.20 KiB)

julia> @btime get_f1($(collect(θ)), $(ϕ')); # eager Vector input saves time
  56.500 μs (3 allocations: 78.20 KiB)

julia> @btime get_f1($θ, $(collect(ϕ'))); # eager Matrix input takes longer
  233.500 μs (3 allocations: 78.20 KiB)

julia> @btime get_f1($(collect(θ)), $(collect(ϕ'))); # both eager, in between
  65.300 μs (3 allocations: 78.20 KiB)

The broadcast loop isn’t doing linear indexing, but benchmarking that shows how much longer processing StepRangeLen takes than loading from Vector at their fastest:

julia> @btime $(range(0.0, π, 8))[$3] # StepRangeLen indexing
  9.400 ns (0 allocations: 0 bytes)
0.8975979010256552

julia> @btime $(collect(range(0.0, π, 8)))[$3] # Vector indexing
  2.000 ns (0 allocations: 0 bytes)
0.8975979010256552
1 Like

Thank you both for the detailed and helpful answers. I don’t fully grasp what is happening yet, but I have gotten some intuition.

The two main takeaways I got are that by fusing all operations (1) I was essentially constructing 2D matrices and operating with scalars on all elements of the 2D matrices which was redundant and (2) I was operating directly with StepRangeLens instead of the intermediate allocated arrays. Both of theses had an impact on performance because of the increased number and cost of operations even though the number of allocations was lower. This makes sense.

I would like to add a general comment though. In most discussions regarding the performance of Julia, broadcasting operations and using ranges instead of arrays are some of the most common suggestions. This example made me realise that this kind of advise can sometimes lead to worst performance if followed carelessly and the general notion of “reducing allocations improves performance” is not as general as I thought is was.

2 Likes

Indeed, ranges are often slower to access than Array(at least when arrays are small-ish and accessed in a predictable order). There are ways to make the ranges somewhat faster, however.

A big part of the cost of ranges is that they are usually TwicePrecision, giving extra accuracy (and critical to making things like 0:0.1:1 behave how you’d hope – try collect(StepRangeLen(0.0, 0.1, 11)) for comparison). But one can give this up to make things a bit faster.

julia> x = range(0, π, 100); y = StepRangeLen(first(x), step(x), length(x)); z = collect(x);

julia> typeof(x)
StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}

julia> typeof(y) # no TwicePrecision
StepRangeLen{Float64, Float64, Float64, Int64}

julia> @btime sum(a for a in $x) # use generator to block other optimizations
  134.046 ns (0 allocations: 0 bytes)
157.07963267948963

julia> @btime sum(a for a in $y) # use generator to block other optimizations
  45.063 ns (0 allocations: 0 bytes)
157.07963267948963

julia> @btime sum(a for a in $z) # use generator to block other optimizations
  19.002 ns (0 allocations: 0 bytes)
157.07963267948963

julia> @btime sum($z) # optimized Array method using SIMD
  5.862 ns (0 allocations: 0 bytes)
157.07963267948966

julia> @btime sum(collect($x)) # the worst of both worlds
  146.022 ns (2 allocations: 928 bytes)
157.07963267948966

julia> @btime sum(collect($y)) # the worst of both worlds
  78.811 ns (2 allocations: 928 bytes)
157.07963267948966

Without TwicePrecision, the range indexing is considerably faster (but not exactly the same, although the sums coincide here). Also note that collecting a range and then iterating the Array once will virtually always be slower than simply iterating it. To see a speed benefit, you’d need to collect it once and then iterate it many times.

The reason ranges are usually recommended is that they are memory-free and that iterating them is usually either 1) much less expensive than other things you’re doing or 2) only done a few times.

3 Likes

Well put. Broadcasting is ultimately a hot loop, and while customized broadcasting for some data structures* or loop-invariant code motion can automatically reduce the operations in that loop, we still have to do some optimization ourselves.

*example of broadcasting over AbstractRanges eagerly instead of putting the work in the loop

julia> begin
           fcount::Int128 = 0 # track how many times f is called
           function f(a::Number, b::Number)
               global fcount += 1
               a+b
           end
           function Broadcast.broadcasted(::Broadcast.DefaultArrayStyle{1}, ::typeof(f), r1::AbstractRange, r2::AbstractRange)
               global fcount += 1
               r1+r2
           end
       end

julia> fcount=0; f.(1:5, f.(1:5, collect(1:5))); fcount # fused to f(x, f(y, z)) in loop
10

julia> fcount=0; f.(f.(1:5, 1:5), collect(1:5)); fcount # automatically unfused
6

julia> fcount=0; f.(identity(f.(1:5, 1:5)), collect(1:5)); fcount # manually unfused
6