I did this in my main program. Indeed, it improved performance. But still, the Matlab GPU performance was not reached
can you please share your recent Matlab and Julia code. I want to try on my machine.
I think matrix multiplication is excluded, in my tests the results were identical but yeah, some spurious small allocations. But itās not a full copy just 160 Bytes.
function lel()
A = rand(51,51,51);
B = rand(A);
C = zeros(51,51,51);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
@. C = A^2 + B^2 + A * B
@. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end
function math2!(D, A, B)
@. D = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
return D
end
@time math1!(C, A, B)
@time math2!(D, A, B)
@show C ā D
return nothing
end
julia> lel()
0.119883 seconds
0.090662 seconds (8 allocations: 160 bytes)
C ā D = true
Dear roflmaostc,
Sure. The Julia code revised by danielwe is as follows:
A = CUDA.rand(151,151,151);
B = CUDA.rand(151,151,151);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
@. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end
function math2!(D, C)
@. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
return D
end
function math3!(E, D)
@. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
return E
end
@btime for iter = 1:1000
math1!($C, $A, $B)
math2!($D, $C)
math3!($E, $D)
end
The Matlab code is as follows:
A = (single(rand(151,151,151,'gpuArray')));
B = (single(rand(151,151,151,'gpuArray')));
C = (single(zeros(151,151,151,'gpuArray')));
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));
tic
for iter = 1:1000
% disp(iter)
C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
E= D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end
toc
That code is 64bits in Julia still.
This returns a Float32
eltype. So it should be good.
Thanks. On my RTX 3060 and Matlab R2023a and my Julia 1.10.6 and Julia - CUDA: 5.5.2 and CUDA runtime 12.CUDA runtime 12.6 I obverse ~0.58s for Matlab and around 1.5s for Julia.
I also removed all divisions by zero (random matrices with entries > 0) but this did not make any difference. I also check if Matlab might eliminate the trivial A .* B .- A .* B
operations, but seems like it doesnāt.
One minor mistake is to not wait for the device in Matlab.
With that I get Matlab=0.85s and Julia=1.4s
A = (single(rand(151,151,151,'gpuArray'))) + 1;
B = (single(rand(151,151,151,'gpuArray'))) + 1;
C = (single(zeros(151,151,151,'gpuArray'))) + 1;
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));
dev = gpuDevice()
timer = tic()
for iter = 1:1000
% disp(iter)
C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end
sum(E(:))
wait(dev)
toc(timer)
Julia:
using CUDA, BenchmarkTools
function f()
A = CUDA.rand(151,151,151) .+ 1;
B = CUDA.rand(151,151,151) .+ 1;
C = CUDA.zeros(151,151,151) .+ 1;
D = similar(C);
E = similar(C);
F = similar(C);
@btime CUDA.@sync begin for iter = 1:1000
@fastmath @. $C = $A^2 + $B^2 + $A * $B + $A / $B - $A * $B - $A / $B + $A * $B + $A / $B - $A * $B - $A / $B
@fastmath @. $D = $C^2 + $C^2 + $C * $C + $C / $C - $C * $C - $C / $C + $C * $C + $C / $C - $C * $C - $C / $C
@fastmath @. $E = $D^2 + $D^2 + $D * $D + $D / $D - $D * $D - $D / $D + $D * $D + $D / $D - $D * $D - $D / $D
end
sum($E)
end
end
f()
f()
Thank you for this test! It is interesting to note that i got the same performance as yours RTX 3060 in Google Colab with Tesla T4 GPU.
1.523 s (432000 allocations: 204.62 MiB)
Moreover, Tesla T4 allocates less than GTX 960.
In my humble opinion, it is really strange that Julia which is developed for high performance scientific computing cannot reach Matlab speed at least. One can say that you can write a CUDA kernel in Julia. On the contrary, you can run CUDA kernel in Matlab too
Feel free to create an issue, including a reproducible example at:
Currently, there are 24 open issues with the label āPerformanceā: GitHub Ā· Where software is built
I agree with you but on the other hand it took me 2min to find an example where Julia outperforms Matlab by a factor of 5
Matlab took 9.729ms whereas Julia took 1.933ms.
A = (single(rand(1024,1024,30,'gpuArray')));
B = (single(rand(1024,1024,30,'gpuArray')));
tic
dev = gpuDevice()
C = sqrt(A) .* exp(B) .* exp(1j .* B .* A);
wait(dev)
toc
using CUDA, BenchmarkTools
function f(A, B)
return sqrt.(A) .* exp.(B) .* exp.(1im .* B .* A)
end
function f()
A = CUDA.rand(1024, 1024, 30)
B = CUDA.rand(1024, 1024, 30)
@btime CUDA.@sync f($A, $B)
return 0
end
f()
From my playing, it seems Julia struggles a bit with the loops. Maybe there is some unnecessary synchronization which happens at each loop?
@roflmaostc Can you check what happens to the Matlab performance if you wait for the device inside the loop? i.e.
for iter = 1:1000
% disp(iter)
C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
wait(dev)
end
or maybe even
for iter = 1:1000
% disp(iter)
C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
wait(dev)
D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
wait(dev)
E = D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
wait(dev)
end
?
In julia, these broadcast operations wait after each operation, so Iām curious if thatās the difference.
0.942s
1.361s
Yeah, that seems to more or less account for the difference then.
Is there a way to fix it?
Is the synchronization needed? I guess not? Because the statements are still worked through sequentially?
Maybe even some sort CUDA.@no_sync
to avoid synchronization in a block?
I got the same result when comparing short formulas calculations in Julia and Matlab. Julia GPU was significantly faster than Matlab GPU. Thatās why i rewrote my Matlab code in Julia. But my Matlab code contains many vectorized operations like in the current example. And it appears that such style of code writing is a bottleneck for Julia
You could consider sharing another code snippet in a new thread of more code.
Maybe there are some low hanging fruits as in your initial post. Translating from Matlab to Julia involves sometimes some subtle changes to speed it up.
Hm, no apparently broadcast calls in julia are not synchronizing, so this is confusing me even more. Apparently only memory copies synchronize here.
For now, i cannot do it since this code is a part of my research work. But the basic idea is the same as in the current example. I have around 180 lines of vectorzied operations (average length formulas). I tried to combine some formulas and reduced the number of lines more than 3 times making a very long formulas like in the present example. It was worsen the performance. Then, i tried to rewrite the code using short formulas and doubled the number of the lines. The performance was the same as in the case with average length formulas
I think splitting the long lines over different lines might reduce allocations but the runtime was the same, even for our small examples here, right?
Yes, but I didnāt know the effect of that when the code runs on the GPU.