Semantically equivalent expressions with wildly different performance

I came across the following example when I needed something that beats the performance of LinearAlgebra’s build in mul!(M, x, y'). Any known reason for this huge difference in performance? Tested on 11.6.

using BenchmarkTools
function outer1!(output, x, y)
    for j in eachindex(y)
        @. (view(output, :, j)) = x *  y[j]
    end
    return output
end

function outer2!(output, x, y)
    for j in eachindex(y)
        v = view(output, :, j)
        @. (v) = x *  y[j]
    end
    return output
end

x = rand(1000); y = rand(1000); M = rand(1000,1000);
outer1!(M, x, y) ≈   outer2!(M, x, y) 
#true

@btime outer1!(M, x, y);
# ~10 ns
@btime outer2!(M, x, y);
# ~200 μs
1 Like

If I understand this correctly (and I might be wrong about this as about so much else):

outer1! (fast): When @. sees the inline view(output, :, j), it can perform much more aggressive optimization, possibly:

  • Eliminating the view allocation entirely
  • Directly generating optimized assembly for the memory access pattern
  • Better vectorization since the compiler sees the full expression structure

outer2! (slow): Pre-computing the view actually prevents these optimizations:

  • The view object must be created and stored
  • The @. macro works on a pre-existing view, limiting optimization opportunities
  • The compiler can’t see through the abstraction as effectively

Nah, I found the issue. The first implementation doesn’t do what I hope it does. The bogus true comparison is because I used the same output buffer… instead

x = rand(1000); y = rand(1000); M1 = zeros(1000,1000); M2 = copy(M1);
outer1!(M1, x, y) ≈   outer2!(M2, x, y) # false

Whoopsy!

4 Likes

Yeah it basically ran faster because it didn’t do anything:

julia> outer1!(copy(M), x, y) ≈ M
true

julia> M ≈ outer2!(copy(M), x, y)
false

Looks like @. failure causing a method expression. This example of overlapping syntax causing a silent error probably counts as a WAT, so I suggest adding a wat tag for future discoverability.

julia> @macroexpand @. (view(output, :, j)) = x *  y[j]
:(view(output, :, j) = begin
          #= REPL[51]:1 =#
          (*).(x, y[j])
      end)

julia> let
       view(output, :, j) = begin
               #= REPL[51]:1 =#
               (*).(x, y[j])
           end
       end
(::var"#view#3") (generic function with 1 method)

julia> @macroexpand @. (v) = x *  y[j]
:(v .= (*).(x, y[j]))

julia> @macroexpand (view(output, :, j)) .= x .*  y[j] # manual fix
:(view(output, :, j) .= x .* y[j])

A fix after which:

julia> outer1!(copy(M), x, y) ≈ outer2!(copy(M), x, y)
true

julia> @btime outer1!($(copy(M)), $x, $y);
  248.700 μs (0 allocations: 0 bytes)

julia> @btime outer2!($(copy(M)), $x, $y);
  249.000 μs (0 allocations: 0 bytes)

Benchmarks that mutate their inputs are generally bad but this might be fine for runtime because there are no conditionals.

2 Likes

It’s not clear to me what the dotted view

view.(output, :, j)

ought to mean. Does it create a view for each element in output? But in that case, why doesn’t it fail with a bounds error for j > 1?

There isn’t a dotted view, @. ignores left hand expressions to support method definitions.

Huh? Since when does @. ignore left hand side? Sorry for the sidetrack, but I’ve never heard anything like that. I thought it dotted every call.

I find no mention of this in the docs

3 Likes

It’s not documented, but it’s how it’s implemented.

3 Likes

I made a wildly uneducated guess.

Great spot and thanks for the WAT tag suggestion.

Side remark about the macro expansion: I was always under the vague impression that @. is the only way to fuse broadcasts, which otherwise remain separate. Yet it seems to expand (in your second-to-last example) to what I had thought is not fused. So is the snippet v .= x .* y fused, or does it allocate a temporary array in the multiplication?

Sequential dots fuse, and @. is actually rarer than manually writing dots. It is sometimes faster to not fuse everything into 1 loop despite the extra allocations, and @. can’t make that call:

julia> longid(x) = (Libc.systemsleep(0.5); x) # expensive call
longid (generic function with 1 method)

julia> @time reshape(1:2, 1, 2) .* longid.(1:2) # fused, only allocates output
  2.044504 seconds (2 allocations: 112 bytes)
2×2 Matrix{Int64}:
 1  2
 2  4

julia> @time reshape(1:2, 1, 2) .* identity(longid.(1:2)) # not fused, also allocates intermediate
  1.012347 seconds (4 allocations: 192 bytes)
2×2 Matrix{Int64}:
 1  2
 2  4

An explicit 2-level loop would let you store the longid call in a temporary variable in the outer loop and save the time in the 1st example and the allocation in the 2nd example, but despite the limitation to calling the fused kernel in an apparent innermost loop, broadcasting does also automatically handle indexing order and expand singleton dimensions, which may be easier to maintain and apply to more inputs.

You can check what an expression does if it gets a bit complicated:

julia> Meta.@lower v .= x .* y # fused before 1 materialize
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = v
│   %2 = Base.broadcasted(*, x, y)
│   %3 = Base.materialize!(%1, %2)
└──      return %3
))))

julia> Meta.@lower v .= identity(x .* y) # 2 separate materialize
:($(Expr(:thunk, CodeInfo(
    @ none within `top-level scope`
1 ─ %1 = v
│   %2 = identity
│   %3 = Base.broadcasted(*, x, y)
│   %4 = Base.materialize(%3)
│   %5 = (%2)(%4)
│   %6 = Base.broadcasted(Base.identity, %5)
│   %7 = Base.materialize!(%1, %6)
└──      return %7
))))
2 Likes