GPU Julia vs GPU Matlab

Alex90 · November 15, 2024, 10:22am

I did this in my main program. Indeed, it improved performance. But still, the Matlab GPU performance was not reached

roflmaostc · November 15, 2024, 10:24am

can you please share your recent Matlab and Julia code. I want to try on my machine.

roflmaostc · November 15, 2024, 10:29am

I think matrix multiplication is excluded, in my tests the results were identical but yeah, some spurious small allocations. But it’s not a full copy just 160 Bytes.

function lel()
    A = rand(51,51,51);
    B = rand(A);
    C = zeros(51,51,51);
    D = similar(C);
    E = similar(C);
    F = similar(C);
    function math1!(C, A, B)
        @. C = A^2 + B^2 + A * B 
        @. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B 
        return C
    end 


    function math2!(D, A, B)
        @. D = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B 
        return D
    end 

    @time math1!(C, A, B)
    @time math2!(D, A, B)


    @show C ≈ D 
    return nothing
end

julia> lel()
  0.119883 seconds
  0.090662 seconds (8 allocations: 160 bytes)
C ≈ D = true

Alex90 · November 15, 2024, 10:37am

Dear roflmaostc,
Sure. The Julia code revised by danielwe is as follows:

A = CUDA.rand(151,151,151);
B = CUDA.rand(151,151,151);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end
function math2!(D, C)
    @. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
    return D
end
function math3!(E, D)
    @. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
    return E
end
@btime for iter = 1:1000
    math1!($C, $A, $B)
    math2!($D, $C)
    math3!($E, $D)
end

The Matlab code is as follows:

A = (single(rand(151,151,151,'gpuArray')));
B = (single(rand(151,151,151,'gpuArray')));
C = (single(zeros(151,151,151,'gpuArray')));
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));
tic
for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
     E= D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end
toc

lmiq · November 15, 2024, 10:48am

That code is 64bits in Julia still.

roflmaostc · November 15, 2024, 11:38am

This returns a Float32 eltype. So it should be good.

roflmaostc · November 15, 2024, 12:32pm

Thanks. On my RTX 3060 and Matlab R2023a and my Julia 1.10.6 and Julia - CUDA: 5.5.2 and CUDA runtime 12.CUDA runtime 12.6 I obverse ~0.58s for Matlab and around 1.5s for Julia.

I also removed all divisions by zero (random matrices with entries > 0) but this did not make any difference. I also check if Matlab might eliminate the trivial A .* B .- A .* B operations, but seems like it doesn’t.

One minor mistake is to not wait for the device in Matlab.

With that I get Matlab=0.85s and Julia=1.4s

A = (single(rand(151,151,151,'gpuArray'))) + 1;
B = (single(rand(151,151,151,'gpuArray'))) + 1;
C = (single(zeros(151,151,151,'gpuArray'))) + 1;
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));


dev = gpuDevice()
timer = tic()

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end
sum(E(:))
wait(dev)
toc(timer)

Julia:

using CUDA, BenchmarkTools
 
 function f()
     A = CUDA.rand(151,151,151) .+ 1;
     B = CUDA.rand(151,151,151) .+ 1;
     C = CUDA.zeros(151,151,151) .+ 1;
     D = similar(C);
     E = similar(C);
     F = similar(C);
     @btime CUDA.@sync begin for iter = 1:1000
             @fastmath @. $C = $A^2 + $B^2 + $A * $B + $A / $B - $A * $B - $A / $B + $A * $B + $A / $B - $A * $B - $A / $B
             @fastmath @. $D = $C^2 + $C^2 + $C * $C + $C / $C - $C * $C - $C / $C + $C * $C + $C / $C - $C * $C - $C / $C
             @fastmath @. $E = $D^2 + $D^2 + $D * $D + $D / $D - $D * $D - $D / $D + $D * $D + $D / $D - $D * $D - $D / $D
         end
         sum($E)
     end 
 end
 
 f()

 
f()

Alex90 · November 15, 2024, 1:39pm

Thank you for this test! It is interesting to note that i got the same performance as yours RTX 3060 in Google Colab with Tesla T4 GPU.
1.523 s (432000 allocations: 204.62 MiB)
Moreover, Tesla T4 allocates less than GTX 960.
In my humble opinion, it is really strange that Julia which is developed for high performance scientific computing cannot reach Matlab speed at least. One can say that you can write a CUDA kernel in Julia. On the contrary, you can run CUDA kernel in Matlab too

ufechner7 · November 15, 2024, 1:47pm

Feel free to create an issue, including a reproducible example at:

Currently, there are 24 open issues with the label “Performance”: GitHub · Where software is built

roflmaostc · November 15, 2024, 1:50pm

I agree with you but on the other hand it took me 2min to find an example where Julia outperforms Matlab by a factor of 5

Matlab took 9.729ms whereas Julia took 1.933ms.

A = (single(rand(1024,1024,30,'gpuArray')));
B = (single(rand(1024,1024,30,'gpuArray')));

tic
dev = gpuDevice()
C = sqrt(A) .* exp(B) .* exp(1j .* B .* A);
wait(dev)
toc

 using CUDA, BenchmarkTools
 
 function f(A, B)
     return sqrt.(A) .* exp.(B) .* exp.(1im .* B .* A)
 end
 
 function f()
     A = CUDA.rand(1024, 1024, 30)
     B = CUDA.rand(1024, 1024, 30)
     @btime CUDA.@sync f($A, $B)
     return 0
 end
 
 f()

From my playing, it seems Julia struggles a bit with the loops. Maybe there is some unnecessary synchronization which happens at each loop?

Mason · November 15, 2024, 1:52pm

@roflmaostc Can you check what happens to the Matlab performance if you wait for the device inside the loop? i.e.

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
    wait(dev)
end

or maybe even

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    wait(dev)
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    wait(dev)
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
    wait(dev)
end

?

~~In julia, these broadcast operations wait after each operation, so I’m curious if that’s the difference.~~

roflmaostc · November 15, 2024, 1:55pm

Mason:

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
    wait(dev)
end

0.942s

Mason:

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    wait(dev)
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    wait(dev)
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
    wait(dev)
end

1.361s

Mason · November 15, 2024, 1:57pm

Yeah, that seems to more or less account for the difference then.

roflmaostc · November 15, 2024, 1:58pm

Is there a way to fix it?
Is the synchronization needed? I guess not? Because the statements are still worked through sequentially?

Maybe even some sort CUDA.@no_sync to avoid synchronization in a block?

Alex90 · November 15, 2024, 2:02pm

I got the same result when comparing short formulas calculations in Julia and Matlab. Julia GPU was significantly faster than Matlab GPU. That’s why i rewrote my Matlab code in Julia. But my Matlab code contains many vectorized operations like in the current example. And it appears that such style of code writing is a bottleneck for Julia

roflmaostc · November 15, 2024, 2:07pm

You could consider sharing another code snippet in a new thread of more code.

Maybe there are some low hanging fruits as in your initial post. Translating from Matlab to Julia involves sometimes some subtle changes to speed it up.

Mason · November 15, 2024, 2:19pm

Hm, no apparently broadcast calls in julia are not synchronizing, so this is confusing me even more. Apparently only memory copies synchronize here.

Alex90 · November 15, 2024, 2:21pm

For now, i cannot do it since this code is a part of my research work. But the basic idea is the same as in the current example. I have around 180 lines of vectorzied operations (average length formulas). I tried to combine some formulas and reduced the number of lines more than 3 times making a very long formulas like in the present example. It was worsen the performance. Then, i tried to rewrite the code using short formulas and doubled the number of the lines. The performance was the same as in the case with average length formulas

roflmaostc · November 15, 2024, 2:24pm

I think splitting the long lines over different lines might reduce allocations but the runtime was the same, even for our small examples here, right?

lmiq · November 15, 2024, 3:06pm

Yes, but I didn’t know the effect of that when the code runs on the GPU.

Topic		Replies	Views
Why Julia is much slower than MATLAB on GPU computing? GPU matlab , cuda	30	3284	November 20, 2023
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	219	June 4, 2025
Matlab versus Julia General Usage	33	4936	July 15, 2021
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4340	December 30, 2022
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1881	June 25, 2023

GPU Julia vs GPU Matlab

Related topics