GPU Julia vs GPU Matlab

maleadt · November 18, 2024, 9:24am

I haven’t read the entire thread, but here’s what I think is a more compact example demonstrating the slowdown:

using CUDA, NVTX

function main()
    A = CUDA.rand(151,151,151) .+ 1
    B = CUDA.rand(151,151,151) .+ 1
    C = CUDA.zeros(151,151,151) .+ 1

    NVTX.@range "fast" begin
        for iter = 1:1000
            inner(A, B) = A^2 + B^2 + A * B + A / B
            C .= inner.(A, B)
        end
    end

    NVTX.@range "slow" begin
        for iter = 1:1000
            C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
        end
    end
end

julia> CUDA.@profile main()
Profiler ran for 83.79 ms, capturing 226207 events.

NVTX ranges:
┌──────────┬────────────┬───────┬───────────┐
│ Time (%) │ Total time │ Calls │ Name      │
├──────────┼────────────┼───────┼───────────┤
│   43.50% │   36.45 ms │     1 │ Main.slow │
│   22.02% │   18.45 ms │     1 │ Main.fast │
└──────────┴────────────┴───────┴───────────┘

The source of the issue seems to be the code generated by broadcast. I can reproduce the same on the CPU:

julia> A = rand(151,151,151) .+ 1;
julia> B = rand(151,151,151) .+ 1;
julia> C = zeros(151,151,151) .+ 1;

julia> using Chairmarks

julia> inner(A, B) = A^2 + B^2 + A * B + A / B
julia> @b C .= inner.(A, B)
922.458 μs (5 allocs: 144 bytes)

julia> @b C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
3.257 ms (23 allocs: 864 bytes)

Topic		Replies	Views
Why Julia is much slower than MATLAB on GPU computing? GPU matlab , cuda	30	3484	November 20, 2023
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8447	March 5, 2017
Benchmark MATLAB & Julia for Matrix Operations Performance	148	19858	October 15, 2019
Matlab versus Julia General Usage	33	4975	July 15, 2021
Arithmetic broadcasting in Julia 5x slower than MATLAB Performance	17	1077	May 26, 2022

GPU Julia vs GPU Matlab

Related topics