I came across the following example when I needed something that beats the performance of LinearAlgebra’s build in mul!(M, x, y'). Any known reason for this huge difference in performance? Tested on 11.6.
using BenchmarkTools
function outer1!(output, x, y)
for j in eachindex(y)
@. (view(output, :, j)) = x * y[j]
end
return output
end
function outer2!(output, x, y)
for j in eachindex(y)
v = view(output, :, j)
@. (v) = x * y[j]
end
return output
end
x = rand(1000); y = rand(1000); M = rand(1000,1000);
outer1!(M, x, y) ≈ outer2!(M, x, y)
#true
@btime outer1!(M, x, y);
# ~10 ns
@btime outer2!(M, x, y);
# ~200 μs
Nah, I found the issue. The first implementation doesn’t do what I hope it does. The bogus true comparison is because I used the same output buffer… instead
Yeah it basically ran faster because it didn’t do anything:
julia> outer1!(copy(M), x, y) ≈ M
true
julia> M ≈ outer2!(copy(M), x, y)
false
Looks like @. failure causing a method expression. This example of overlapping syntax causing a silent error probably counts as a WAT, so I suggest adding a wat tag for future discoverability.
julia> @macroexpand @. (view(output, :, j)) = x * y[j]
:(view(output, :, j) = begin
#= REPL[51]:1 =#
(*).(x, y[j])
end)
julia> let
view(output, :, j) = begin
#= REPL[51]:1 =#
(*).(x, y[j])
end
end
(::var"#view#3") (generic function with 1 method)
julia> @macroexpand @. (v) = x * y[j]
:(v .= (*).(x, y[j]))
julia> @macroexpand (view(output, :, j)) .= x .* y[j] # manual fix
:(view(output, :, j) .= x .* y[j])
Side remark about the macro expansion: I was always under the vague impression that @. is the only way to fuse broadcasts, which otherwise remain separate. Yet it seems to expand (in your second-to-last example) to what I had thought is not fused. So is the snippet v .= x .* y fused, or does it allocate a temporary array in the multiplication?
Sequential dots fuse, and @. is actually rarer than manually writing dots. It is sometimes faster to not fuse everything into 1 loop despite the extra allocations, and @. can’t make that call:
An explicit 2-level loop would let you store the longid call in a temporary variable in the outer loop and save the time in the 1st example and the allocation in the 2nd example, but despite the limitation to calling the fused kernel in an apparent innermost loop, broadcasting does also automatically handle indexing order and expand singleton dimensions, which may be easier to maintain and apply to more inputs.
You can check what an expression does if it gets a bit complicated:
julia> Meta.@lower v .= x .* y # fused before 1 materialize
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1 = v
│ %2 = Base.broadcasted(*, x, y)
│ %3 = Base.materialize!(%1, %2)
└── return %3
))))
julia> Meta.@lower v .= identity(x .* y) # 2 separate materialize
:($(Expr(:thunk, CodeInfo(
@ none within `top-level scope`
1 ─ %1 = v
│ %2 = identity
│ %3 = Base.broadcasted(*, x, y)
│ %4 = Base.materialize(%3)
│ %5 = (%2)(%4)
│ %6 = Base.broadcasted(Base.identity, %5)
│ %7 = Base.materialize!(%1, %6)
└── return %7
))))