I haven’t read the entire thread, but here’s what I think is a more compact example demonstrating the slowdown:
using CUDA, NVTX
function main()
A = CUDA.rand(151,151,151) .+ 1
B = CUDA.rand(151,151,151) .+ 1
C = CUDA.zeros(151,151,151) .+ 1
NVTX.@range "fast" begin
for iter = 1:1000
inner(A, B) = A^2 + B^2 + A * B + A / B
C .= inner.(A, B)
end
end
NVTX.@range "slow" begin
for iter = 1:1000
C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
end
end
end
julia> CUDA.@profile main()
Profiler ran for 83.79 ms, capturing 226207 events.
NVTX ranges:
┌──────────┬────────────┬───────┬───────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼───────────┤
│ 43.50% │ 36.45 ms │ 1 │ Main.slow │
│ 22.02% │ 18.45 ms │ 1 │ Main.fast │
└──────────┴────────────┴───────┴───────────┘
The source of the issue seems to be the code generated by broadcast. I can reproduce the same on the CPU:
julia> A = rand(151,151,151) .+ 1;
julia> B = rand(151,151,151) .+ 1;
julia> C = zeros(151,151,151) .+ 1;
julia> using Chairmarks
julia> inner(A, B) = A^2 + B^2 + A * B + A / B
julia> @b C .= inner.(A, B)
922.458 μs (5 allocs: 144 bytes)
julia> @b C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
3.257 ms (23 allocs: 864 bytes)