GPU Julia vs GPU Matlab

maleadt · November 18, 2024, 9:24am

I haven’t read the entire thread, but here’s what I think is a more compact example demonstrating the slowdown:

using CUDA, NVTX

function main()
    A = CUDA.rand(151,151,151) .+ 1
    B = CUDA.rand(151,151,151) .+ 1
    C = CUDA.zeros(151,151,151) .+ 1

    NVTX.@range "fast" begin
        for iter = 1:1000
            inner(A, B) = A^2 + B^2 + A * B + A / B
            C .= inner.(A, B)
        end
    end

    NVTX.@range "slow" begin
        for iter = 1:1000
            C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
        end
    end
end

julia> CUDA.@profile main()
Profiler ran for 83.79 ms, capturing 226207 events.

NVTX ranges:
┌──────────┬────────────┬───────┬───────────┐
│ Time (%) │ Total time │ Calls │ Name      │
├──────────┼────────────┼───────┼───────────┤
│   43.50% │   36.45 ms │     1 │ Main.slow │
│   22.02% │   18.45 ms │     1 │ Main.fast │
└──────────┴────────────┴───────┴───────────┘

The source of the issue seems to be the code generated by broadcast. I can reproduce the same on the CPU:

julia> A = rand(151,151,151) .+ 1;
julia> B = rand(151,151,151) .+ 1;
julia> C = zeros(151,151,151) .+ 1;

julia> using Chairmarks

julia> inner(A, B) = A^2 + B^2 + A * B + A / B
julia> @b C .= inner.(A, B)
922.458 μs (5 allocs: 144 bytes)

julia> @b C .= A.^2 .+ B.^2 .+ A .* B .+ A ./ B
3.257 ms (23 allocs: 864 bytes)

ufechner7 · November 18, 2024, 9:32am

@maleadt Perhaps create an issue?

Topic		Replies	Views
Why Julia is much slower than MATLAB on GPU computing? GPU matlab , cuda	30	3259	November 20, 2023
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	205	June 4, 2025
Matlab versus Julia General Usage	33	4926	July 15, 2021
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4332	December 30, 2022
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1867	June 25, 2023

GPU Julia vs GPU Matlab

Related topics