I havenβt read the entire thread, but hereβs what I think is a more compact example demonstrating the slowdown:
using CUDA, NVTX
function main()
A = CUDA.rand(151,151,151) .+ 1
B = CUDA.rand(151,151,151) .+ 1
C = CUDA.zeros(151,151,151) .+ 1
NVTX.@range "fast" begin
for iter = 1:1000
inner(A, B) = A^2 + B^2 + A * B + A / B
C .= inner.(A, B)
end
end
NVTX.@range "slow" begin
for iter = 1:1000
C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
end
end
end
julia> CUDA.@profile main()
Profiler ran for 83.79 ms, capturing 226207 events.
NVTX ranges:
ββββββββββββ¬βββββββββββββ¬ββββββββ¬ββββββββββββ
β Time (%) β Total time β Calls β Name β
ββββββββββββΌβββββββββββββΌββββββββΌββββββββββββ€
β 43.50% β 36.45 ms β 1 β Main.slow β
β 22.02% β 18.45 ms β 1 β Main.fast β
ββββββββββββ΄βββββββββββββ΄ββββββββ΄ββββββββββββ
The source of the issue seems to be the code generated by broadcast. I can reproduce the same on the CPU:
julia> A = rand(151,151,151) .+ 1;
julia> B = rand(151,151,151) .+ 1;
julia> C = zeros(151,151,151) .+ 1;
julia> using Chairmarks
julia> inner(A, B) = A^2 + B^2 + A * B + A / B
julia> @b C .= inner.(A, B)
922.458 ΞΌs (5 allocs: 144 bytes)
julia> @b C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
3.257 ms (23 allocs: 864 bytes)