GPU Julia vs GPU Matlab

I did this in my main program. Indeed, it improved performance. But still, the Matlab GPU performance was not reached

can you please share your recent Matlab and Julia code. I want to try on my machine.

I think matrix multiplication is excluded, in my tests the results were identical but yeah, some spurious small allocations. But itā€™s not a full copy just 160 Bytes.

function lel()
    A = rand(51,51,51);
    B = rand(A);
    C = zeros(51,51,51);
    D = similar(C);
    E = similar(C);
    F = similar(C);
    function math1!(C, A, B)
        @. C = A^2 + B^2 + A * B 
        @. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B 
        return C
    end 


    function math2!(D, A, B)
        @. D = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B 
        return D
    end 

    @time math1!(C, A, B)
    @time math2!(D, A, B)


    @show C ā‰ˆ D 
    return nothing
end

julia> lel()
  0.119883 seconds
  0.090662 seconds (8 allocations: 160 bytes)
C ā‰ˆ D = true
1 Like

Dear roflmaostc,
Sure. The Julia code revised by danielwe is as follows:

A = CUDA.rand(151,151,151);
B = CUDA.rand(151,151,151);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end
function math2!(D, C)
    @. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
    return D
end
function math3!(E, D)
    @. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
    return E
end
@btime for iter = 1:1000
    math1!($C, $A, $B)
    math2!($D, $C)
    math3!($E, $D)
end

The Matlab code is as follows:

A = (single(rand(151,151,151,'gpuArray')));
B = (single(rand(151,151,151,'gpuArray')));
C = (single(zeros(151,151,151,'gpuArray')));
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));
tic
for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
     E= D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end
toc
1 Like

That code is 64bits in Julia still.

2 Likes

This returns a Float32 eltype. So it should be good.

3 Likes

Thanks. On my RTX 3060 and Matlab R2023a and my Julia 1.10.6 and Julia - CUDA: 5.5.2 and CUDA runtime 12.CUDA runtime 12.6 I obverse ~0.58s for Matlab and around 1.5s for Julia.

I also removed all divisions by zero (random matrices with entries > 0) but this did not make any difference. I also check if Matlab might eliminate the trivial A .* B .- A .* B operations, but seems like it doesnā€™t.

One minor mistake is to not wait for the device in Matlab.

With that I get Matlab=0.85s and Julia=1.4s

A = (single(rand(151,151,151,'gpuArray'))) + 1;
B = (single(rand(151,151,151,'gpuArray'))) + 1;
C = (single(zeros(151,151,151,'gpuArray'))) + 1;
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));


dev = gpuDevice()
timer = tic()

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end
sum(E(:))
wait(dev)
toc(timer)

Julia:

using CUDA, BenchmarkTools
 
 function f()
     A = CUDA.rand(151,151,151) .+ 1;
     B = CUDA.rand(151,151,151) .+ 1;
     C = CUDA.zeros(151,151,151) .+ 1;
     D = similar(C);
     E = similar(C);
     F = similar(C);
     @btime CUDA.@sync begin for iter = 1:1000
             @fastmath @. $C = $A^2 + $B^2 + $A * $B + $A / $B - $A * $B - $A / $B + $A * $B + $A / $B - $A * $B - $A / $B
             @fastmath @. $D = $C^2 + $C^2 + $C * $C + $C / $C - $C * $C - $C / $C + $C * $C + $C / $C - $C * $C - $C / $C
             @fastmath @. $E = $D^2 + $D^2 + $D * $D + $D / $D - $D * $D - $D / $D + $D * $D + $D / $D - $D * $D - $D / $D
         end
         sum($E)
     end 
 end
 
 f()

 
f()
6 Likes

Thank you for this test! It is interesting to note that i got the same performance as yours RTX 3060 in Google Colab with Tesla T4 GPU.
1.523 s (432000 allocations: 204.62 MiB)
Moreover, Tesla T4 allocates less than GTX 960.
In my humble opinion, it is really strange that Julia which is developed for high performance scientific computing cannot reach Matlab speed at least. One can say that you can write a CUDA kernel in Julia. On the contrary, you can run CUDA kernel in Matlab too

Feel free to create an issue, including a reproducible example at:

Currently, there are 24 open issues with the label ā€œPerformanceā€: GitHub Ā· Where software is built

3 Likes

I agree with you but on the other hand it took me 2min to find an example where Julia outperforms Matlab by a factor of 5 :slight_smile:

Matlab took 9.729ms whereas Julia took 1.933ms.

A = (single(rand(1024,1024,30,'gpuArray')));
B = (single(rand(1024,1024,30,'gpuArray')));

tic
dev = gpuDevice()
C = sqrt(A) .* exp(B) .* exp(1j .* B .* A);
wait(dev)
toc
 using CUDA, BenchmarkTools
 
 function f(A, B)
     return sqrt.(A) .* exp.(B) .* exp.(1im .* B .* A)
 end
 
 function f()
     A = CUDA.rand(1024, 1024, 30)
     B = CUDA.rand(1024, 1024, 30)
     @btime CUDA.@sync f($A, $B)
     return 0
 end
 
 f()

From my playing, it seems Julia struggles a bit with the loops. Maybe there is some unnecessary synchronization which happens at each loop?

1 Like

@roflmaostc Can you check what happens to the Matlab performance if you wait for the device inside the loop? i.e.

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
    wait(dev)
end

or maybe even

for iter = 1:1000
    % disp(iter)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    wait(dev)
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
    wait(dev)
    E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
    wait(dev)
end

?

In julia, these broadcast operations wait after each operation, so Iā€™m curious if thatā€™s the difference.

2 Likes

0.942s

1.361s

1 Like

Yeah, that seems to more or less account for the difference then.

2 Likes

Is there a way to fix it?
Is the synchronization needed? I guess not? Because the statements are still worked through sequentially?

Maybe even some sort CUDA.@no_sync to avoid synchronization in a block?

I got the same result when comparing short formulas calculations in Julia and Matlab. Julia GPU was significantly faster than Matlab GPU. Thatā€™s why i rewrote my Matlab code in Julia. But my Matlab code contains many vectorized operations like in the current example. And it appears that such style of code writing is a bottleneck for Julia

2 Likes

You could consider sharing another code snippet in a new thread of more code.

Maybe there are some low hanging fruits as in your initial post. Translating from Matlab to Julia involves sometimes some subtle changes to speed it up.

1 Like

Hm, no apparently broadcast calls in julia are not synchronizing, so this is confusing me even more. Apparently only memory copies synchronize here.

For now, i cannot do it since this code is a part of my research work. But the basic idea is the same as in the current example. I have around 180 lines of vectorzied operations (average length formulas). I tried to combine some formulas and reduced the number of lines more than 3 times making a very long formulas like in the present example. It was worsen the performance. Then, i tried to rewrite the code using short formulas and doubled the number of the lines. The performance was the same as in the case with average length formulas

I think splitting the long lines over different lines might reduce allocations but the runtime was the same, even for our small examples here, right?

Yes, but I didnā€™t know the effect of that when the code runs on the GPU.

1 Like