GPU Julia vs GPU Matlab

Hello,
I have tested a simple math example in Julia using GPU computing and GTX 960. The code is as follows:

A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
@. function math1(A,B)
    C = A.^2 .+ B.^2 .+ A .* B .+ A ./ B .- A .* B .- A ./ B .+ A .* B .+ A ./ B .- A .* B .- A ./ B;
    return C
end
@. function math2(C)
    D = C.^2 .+ C.^2 .+ C .* C .+ C ./ C .- C .* C .- C ./ C .+ C .* C .+ C ./ C .- C .* C .- C ./ C;
    return D
end
@. function math3(D)
    E= D.^2 .+ D.^2 .+ D .* D .+ D ./ D .- D .* D .- D ./ D .+ D .* D .+ D ./ D .- D .* D .- D ./ D;
    return E
end
@btime for iter = 1:1000
    C = math1(A,B);
    D = math2(C);
    E = math3(D)
end
10.619 s (1671000 allocations: 26.09 MiB)

And i run the same code in Matlab(2023b) and got around 2.7 seconds.
A similar performance problem was discussed in Why Julia is much slower than MATLAB on GPU computing?
So i am curious is it expectable performance behavior? Can Julia be faster than Matlab without writing a GPU kernel manually?

1 Like

Here’s a more idiomatic Julia version:

A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end
function math2!(D, C)
    @. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
    return D
end
function math3!(E, D)
    @. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
    return E
end
@btime for iter = 1:1000
    math1!($C, $A, $B)
    math2!($D, $C)
    math3!($E, $D)
end

What time do you get with this code?

Key points:

  • If you want to mutate your preallocated arrays C, D, E, you need to pass them to the functions as arguments, and apply a mutating operation to them. C = ... does not mutate, it creates a new variable C. However, @. C = ... mutates because it translates to C .= ..., which is syntax for in-place broadcasting.
  • The macro @. goes in front of an expression to insert dots (i.e., fused broadcasting) at every subexpression. When you use it, you don’t have to insert any dots manually. (I’ve never seen it placed in front of the entire method definition before, not sure what it would do there.)
  • When benchmarking with @btime you should interpolate global variables with $, otherwise the expression can’t be type inferred and compiled to efficient code.

Minor point: you don’t need to terminate lines with a semicolon in Julia.

Let us know how the updated code performs!

3 Likes

Dear Daniel,
Many thanks for your response. I tried a very similar code modification before and it gave me the same result. On the other hand, the code revised by you provides this output:
10.512 s (1698000 allocations: 26.64 MiB)
As i understand, Julia does’t like long formulas. Previously, i did another performance test where i did no more than 3 operations (sum and double square). In this case, Julia significantly outperformed Matlab

1 Like

Is it possible that dot fusion gives up after too many operations?

1 Like

Try putting the whole computation in a function to which you pass A B C DCE

Dear rveltz,
I have tried and it gave the same result

You start with zeros and then NaN right? Not sure how the GPU handles that

@rveltz might be onto something. B is never initialized; similar returns uninitialized memory. What did you intend to fill B with before your computations?

1 Like

B = CUDA.rand(151, 151, 151) helped a little bit on my end, but not much. Perhaps there’s a limit to loop fusion as @gdalle suggests? Broadcast support for GPU arrays, including CuArrays, is implemented using KernelAbstractions, which I’ve never used. Here’s the relevant method: GPUArrays.jl/src/host/broadcast.jl at 77d7d5104cd1deea1a22343ff6fad2b7d65b3c28 · JuliaGPU/GPUArrays.jl · GitHub

1 Like

You are right. It should be
B = CUDA.rand(151,151,151)
Then, there are no more NaN values. However, it slightly affected the performance. The output is as follows:
9.379 s (1671000 allocations: 26.09 MiB)

Thanks for the link!

You’re using float64 on a pretty old GPU, which is really slow. To give the GPU a chance, you should switch to float32

2 Likes

I am wondering if MATLAB is doing some Float32 conversions…

Is the code non allocating when run in the CPU?

1 Like

Dear sdanisch,
I used Float32 both with Julia and Matlab

Dear tamasgal,
Before computations, i created gpuArrays in Matlab with a single precision (Float32)

Dear Imiq,
My version of the code run on CPU gives this output:
45.033 s (15000 allocations: 656.25 KiB)
The code revised by danielwe provides another result:
47.197 s (33000 allocations: 1.01 MiB)

Then there is an issue with broadcasting fusion. Maybe the * symbols are still performing matrix multiplication? I never fully trust @. because I’m not sure how it works

1 Like

Without @. i get the same results when performing element-wise operations:
9.458 s (1671000 allocations: 26.09 MiB)

Then I would probably try to split that long line in a few smaller lines, too see what happens.

Edit: effectielly, if you break that long line, it becomes non-allocating in the CPU:

A = rand(51,51,51);
B = rand(A);
C = zeros(51,51,51);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
    @. C = A^2 + B^2 + A * B 
    @. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
    return C
end

before:

julia> @time math1!(C,A,B);
  0.000845 seconds (11 allocations: 352 bytes)

after:

julia> @time math1!(C,A,B);
  0.000966 seconds

(this was with a smaller matrix, just for testing)

4 Likes