Semantically equivalent expressions with wildly different performance

bremez · July 28, 2025, 4:40am

I came across the following example when I needed something that beats the performance of LinearAlgebra’s build in mul!(M, x, y'). Any known reason for this huge difference in performance? Tested on 11.6.

using BenchmarkTools
function outer1!(output, x, y)
    for j in eachindex(y)
        @. (view(output, :, j)) = x *  y[j]
    end
    return output
end

function outer2!(output, x, y)
    for j in eachindex(y)
        v = view(output, :, j)
        @. (v) = x *  y[j]
    end
    return output
end

x = rand(1000); y = rand(1000); M = rand(1000,1000);
outer1!(M, x, y) ≈   outer2!(M, x, y) 
#true

@btime outer1!(M, x, y);
# ~10 ns
@btime outer2!(M, x, y);
# ~200 μs

technocrat · July 28, 2025, 4:58am

If I understand this correctly (and I might be wrong about this as about so much else):

outer1! (fast): When @. sees the inline view(output, :, j), it can perform much more aggressive optimization, possibly:

Eliminating the view allocation entirely
Directly generating optimized assembly for the memory access pattern
Better vectorization since the compiler sees the full expression structure

outer2! (slow): Pre-computing the view actually prevents these optimizations:

The view object must be created and stored
The @. macro works on a pre-existing view, limiting optimization opportunities
The compiler can’t see through the abstraction as effectively

bremez · July 28, 2025, 5:14am

Nah, I found the issue. The first implementation doesn’t do what I hope it does. The bogus true comparison is because I used the same output buffer… instead

x = rand(1000); y = rand(1000); M1 = zeros(1000,1000); M2 = copy(M1);
outer1!(M1, x, y) ≈   outer2!(M2, x, y) # false

Whoopsy!

Benny · July 28, 2025, 5:44am

Yeah it basically ran faster because it didn’t do anything:

julia> outer1!(copy(M), x, y) ≈ M
true

julia> M ≈ outer2!(copy(M), x, y)
false

Looks like @. failure causing a method expression. This example of overlapping syntax causing a silent error probably counts as a WAT, so I suggest adding a wat tag for future discoverability.

julia> @macroexpand @. (view(output, :, j)) = x *  y[j]
:(view(output, :, j) = begin
          #= REPL[51]:1 =#
          (*).(x, y[j])
      end)

julia> let
       view(output, :, j) = begin
               #= REPL[51]:1 =#
               (*).(x, y[j])
           end
       end
(::var"#view#3") (generic function with 1 method)

julia> @macroexpand @. (v) = x *  y[j]
:(v .= (*).(x, y[j]))

julia> @macroexpand (view(output, :, j)) .= x .*  y[j] # manual fix
:(view(output, :, j) .= x .* y[j])

A fix after which:

julia> outer1!(copy(M), x, y) ≈ outer2!(copy(M), x, y)
true

julia> @btime outer1!($(copy(M)), $x, $y);
  248.700 μs (0 allocations: 0 bytes)

julia> @btime outer2!($(copy(M)), $x, $y);
  249.000 μs (0 allocations: 0 bytes)

Benchmarks that mutate their inputs are generally bad but this might be fine for runtime because there are no conditionals.

DNF · July 28, 2025, 6:49am

It’s not clear to me what the dotted view

view.(output, :, j)

ought to mean. Does it create a view for each element in output? But in that case, why doesn’t it fail with a bounds error for j > 1?

Benny · July 28, 2025, 7:33am

There isn’t a dotted view, @. ignores left hand expressions to support method definitions.

DNF · July 28, 2025, 8:14am

Huh? Since when does @. ignore left hand side? Sorry for the sidetrack, but I’ve never heard anything like that. I thought it dotted every call.

I find no mention of this in the docs

Benny · July 28, 2025, 8:53am

It’s not documented, but it’s how it’s implemented.

technocrat · July 28, 2025, 10:10am

I made a wildly uneducated guess.

bremez · July 28, 2025, 2:42pm

Great spot and thanks for the WAT tag suggestion.

Side remark about the macro expansion: I was always under the vague impression that @. is the only way to fuse broadcasts, which otherwise remain separate. Yet it seems to expand (in your second-to-last example) to what I had thought is not fused. So is the snippet v .= x .* y fused, or does it allocate a temporary array in the multiplication?

Benny · July 28, 2025, 2:54pm

Sequential dots fuse, and @. is actually rarer than manually writing dots. It is sometimes faster to not fuse everything into 1 loop despite the extra allocations, and @. can’t make that call:

julia> longid(x) = (Libc.systemsleep(0.5); x) # expensive call
longid (generic function with 1 method)

julia> @time reshape(1:2, 1, 2) .* longid.(1:2) # fused, only allocates output
  2.044504 seconds (2 allocations: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 2  4

julia> @time reshape(1:2, 1, 2) .* identity(longid.(1:2)) # not fused, also allocates intermediate
  1.012347 seconds (4 allocations: 192 bytes)
2×2 Matrix{Int64}:
 1  2
 2  4

An explicit 2-level loop would let you store the longid call in a temporary variable in the outer loop and save the time in the 1st example and the allocation in the 2nd example, but despite the limitation to calling the fused kernel in an apparent innermost loop, broadcasting does also automatically handle indexing order and expand singleton dimensions, which may be easier to maintain and apply to more inputs.

You can check what an expression does if it gets a bit complicated:

julia> Meta.@lower v .= x .* y # fused before 1 materialize
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = v
│   %2 = Base.broadcasted(*, x, y)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

julia> Meta.@lower v .= identity(x .* y) # 2 separate materialize
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = v
│   %2 = identity
│   %3 = Base.broadcasted(*, x, y)
│   %4 = Base.materialize(%3)
│   %5 = (%2)(%4)
│   %6 = Base.broadcasted(Base.identity, %5)
│   %7 = Base.materialize!(%1, %6)
└──      return %7
))))

Topic		Replies	Views
Migrating code from Matlab: Accessing arrays without allocating Performance	20	2024	November 19, 2018
Performance of array broadcasting General Usage question	11	1237	July 8, 2017
"@." can induce poor performance sometimes Performance	7	770	May 26, 2018
Understanding major order performance when broadcasting in column vs row operations Performance question , array , benchmark	9	1042	June 21, 2021
Surprised that broadcast does not use views? Performance broadcast	8	1282	June 26, 2019

Semantically equivalent expressions with wildly different performance

Related topics