GPU Julia vs GPU Matlab

I haven’t read the entire thread, but here’s what I think is a more compact example demonstrating the slowdown:

using CUDA, NVTX

function main()
    A = CUDA.rand(151,151,151) .+ 1
    B = CUDA.rand(151,151,151) .+ 1
    C = CUDA.zeros(151,151,151) .+ 1

    NVTX.@range "fast" begin
        for iter = 1:1000
            inner(A, B) = A^2 + B^2 + A * B + A / B
            C .= inner.(A, B)
        end
    end

    NVTX.@range "slow" begin
        for iter = 1:1000
            C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
        end
    end
end
julia> CUDA.@profile main()
Profiler ran for 83.79 ms, capturing 226207 events.

NVTX ranges:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Time (%) β”‚ Total time β”‚ Calls β”‚ Name      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   43.50% β”‚   36.45 ms β”‚     1 β”‚ Main.slow β”‚
β”‚   22.02% β”‚   18.45 ms β”‚     1 β”‚ Main.fast β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The source of the issue seems to be the code generated by broadcast. I can reproduce the same on the CPU:

julia> A = rand(151,151,151) .+ 1;
julia> B = rand(151,151,151) .+ 1;
julia> C = zeros(151,151,151) .+ 1;

julia> using Chairmarks

julia> inner(A, B) = A^2 + B^2 + A * B + A / B
julia> @b C .= inner.(A, B)
922.458 ΞΌs (5 allocs: 144 bytes)

julia> @b C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
3.257 ms (23 allocs: 864 bytes)
9 Likes

@maleadt Perhaps create an issue?

1 Like