GPU Julia vs GPU Matlab

Alex90 · November 15, 2024, 5:14am

Hello,
I have tested a simple math example in Julia using GPU computing and GTX 960. The code is as follows:

A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
@. function math1(A,B)
    C = A.^2 .+ B.^2 .+ A .* B .+ A ./ B .- A .* B .- A ./ B .+ A .* B .+ A ./ B .- A .* B .- A ./ B;
    return C
end
@. function math2(C)
    D = C.^2 .+ C.^2 .+ C .* C .+ C ./ C .- C .* C .- C ./ C .+ C .* C .+ C ./ C .- C .* C .- C ./ C;
    return D
end
@. function math3(D)
    E= D.^2 .+ D.^2 .+ D .* D .+ D ./ D .- D .* D .- D ./ D .+ D .* D .+ D ./ D .- D .* D .- D ./ D;
    return E
end
@btime for iter = 1:1000
    C = math1(A,B);
    D = math2(C);
    E = math3(D)
end
10.619 s (1671000 allocations: 26.09 MiB)

And i run the same code in Matlab(2023b) and got around 2.7 seconds.
A similar performance problem was discussed in Why Julia is much slower than MATLAB on GPU computing?
So i am curious is it expectable performance behavior? Can Julia be faster than Matlab without writing a GPU kernel manually?

danielwe · November 15, 2024, 5:41am

Here’s a more idiomatic Julia version:

A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end
function math2!(D, C)
    @. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
    return D
end
function math3!(E, D)
    @. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
    return E
end
@btime for iter = 1:1000
    math1!($C, $A, $B)
    math2!($D, $C)
    math3!($E, $D)
end

What time do you get with this code?

Key points:

If you want to mutate your preallocated arrays C, D, E, you need to pass them to the functions as arguments, and apply a mutating operation to them. C = ... does not mutate, it creates a new variable C. However, @. C = ... mutates because it translates to C .= ..., which is syntax for in-place broadcasting.
The macro @. goes in front of an expression to insert dots (i.e., fused broadcasting) at every subexpression. When you use it, you don’t have to insert any dots manually. (I’ve never seen it placed in front of the entire method definition before, not sure what it would do there.)
When benchmarking with @btime you should interpolate global variables with $, otherwise the expression can’t be type inferred and compiled to efficient code.

Minor point: you don’t need to terminate lines with a semicolon in Julia.

Let us know how the updated code performs!

Alex90 · November 15, 2024, 6:21am

Dear Daniel,
Many thanks for your response. I tried a very similar code modification before and it gave me the same result. On the other hand, the code revised by you provides this output:
10.512 s (1698000 allocations: 26.64 MiB)
As i understand, Julia does’t like long formulas. Previously, i did another performance test where i did no more than 3 operations (sum and double square). In this case, Julia significantly outperformed Matlab

gdalle · November 15, 2024, 6:39am

Is it possible that dot fusion gives up after too many operations?

rveltz · November 15, 2024, 6:44am

Try putting the whole computation in a function to which you pass A B C DCE

Alex90 · November 15, 2024, 6:56am

Dear rveltz,
I have tried and it gave the same result

rveltz · November 15, 2024, 7:10am

You start with zeros and then NaN right? Not sure how the GPU handles that

danielwe · November 15, 2024, 7:14am

danielwe:

A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end

@rveltz might be onto something. B is never initialized; similar returns uninitialized memory. What did you intend to fill B with before your computations?

danielwe · November 15, 2024, 7:24am

B = CUDA.rand(151, 151, 151) helped a little bit on my end, but not much. Perhaps there’s a limit to loop fusion as @gdalle suggests? Broadcast support for GPU arrays, including CuArrays, is implemented using KernelAbstractions, which I’ve never used. Here’s the relevant method: GPUArrays.jl/src/host/broadcast.jl at 77d7d5104cd1deea1a22343ff6fad2b7d65b3c28 · JuliaGPU/GPUArrays.jl · GitHub

Alex90 · November 15, 2024, 8:05am

You are right. It should be
B = CUDA.rand(151,151,151)
Then, there are no more NaN values. However, it slightly affected the performance. The output is as follows:
9.379 s (1671000 allocations: 26.09 MiB)

Alex90 · November 15, 2024, 8:07am

Thanks for the link!

sdanisch · November 15, 2024, 8:52am

You’re using float64 on a pretty old GPU, which is really slow. To give the GPU a chance, you should switch to float32

tamasgal · November 15, 2024, 8:57am

I am wondering if MATLAB is doing some Float32 conversions…

lmiq · November 15, 2024, 9:02am

Is the code non allocating when run in the CPU?

Alex90 · November 15, 2024, 9:33am

Dear sdanisch,
I used Float32 both with Julia and Matlab

Alex90 · November 15, 2024, 9:35am

Dear tamasgal,
Before computations, i created gpuArrays in Matlab with a single precision (Float32)

Alex90 · November 15, 2024, 9:46am

Dear Imiq,
My version of the code run on CPU gives this output:
45.033 s (15000 allocations: 656.25 KiB)
The code revised by danielwe provides another result:
47.197 s (33000 allocations: 1.01 MiB)

gdalle · November 15, 2024, 9:48am

Then there is an issue with broadcasting fusion. Maybe the * symbols are still performing matrix multiplication? I never fully trust @. because I’m not sure how it works

Alex90 · November 15, 2024, 9:57am

Without @. i get the same results when performing element-wise operations:
9.458 s (1671000 allocations: 26.09 MiB)

lmiq · November 15, 2024, 10:07am

Then I would probably try to split that long line in a few smaller lines, too see what happens.

Edit: effectielly, if you break that long line, it becomes non-allocating in the CPU:

A = rand(51,51,51);
B = rand(A);
C = zeros(51,51,51);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B 
    @. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end

before:

julia> @time math1!(C,A,B);
  0.000845 seconds (11 allocations: 352 bytes)

after:

julia> @time math1!(C,A,B);
  0.000966 seconds

(this was with a smaller matrix, just for testing)

Topic		Replies	Views
Why Julia is much slower than MATLAB on GPU computing? GPU matlab , cuda	30	2852	November 20, 2023
Matlab versus Julia General Usage	33	4786	July 15, 2021
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4207	December 30, 2022
Julia is significantly slower (~18 x) than Matlab in vector and matrix algebra New to Julia	32	1710	June 25, 2023
CUDA matmul performance GPU question , performance	11	1447	August 21, 2020

GPU Julia vs GPU Matlab

Related topics