Hello,
I have tested a simple math example in Julia using GPU computing and GTX 960. The code is as follows:
A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
@. function math1(A,B)
C = A.^2 .+ B.^2 .+ A .* B .+ A ./ B .- A .* B .- A ./ B .+ A .* B .+ A ./ B .- A .* B .- A ./ B;
return C
end
@. function math2(C)
D = C.^2 .+ C.^2 .+ C .* C .+ C ./ C .- C .* C .- C ./ C .+ C .* C .+ C ./ C .- C .* C .- C ./ C;
return D
end
@. function math3(D)
E= D.^2 .+ D.^2 .+ D .* D .+ D ./ D .- D .* D .- D ./ D .+ D .* D .+ D ./ D .- D .* D .- D ./ D;
return E
end
@btime for iter = 1:1000
C = math1(A,B);
D = math2(C);
E = math3(D)
end
10.619 s (1671000 allocations: 26.09 MiB)
And i run the same code in Matlab(2023b) and got around 2.7 seconds.
A similar performance problem was discussed in Why Julia is much slower than MATLAB on GPU computing?
So i am curious is it expectable performance behavior? Can Julia be faster than Matlab without writing a GPU kernel manually?
A = CUDA.rand(151,151,151);
B = similar(A);
C = CUDA.zeros(151,151,151);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
@. C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end
function math2!(D, C)
@. D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
return D
end
function math3!(E, D)
@. E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
return E
end
@btime for iter = 1:1000
math1!($C, $A, $B)
math2!($D, $C)
math3!($E, $D)
end
What time do you get with this code?
Key points:
If you want to mutate your preallocated arrays C, D, E, you need to pass them to the functions as arguments, and apply a mutating operation to them. C = ... does not mutate, it creates a new variable C. However, @. C = ... mutates because it translates to C .= ..., which is syntax for in-place broadcasting.
The macro @. goes in front of an expression to insert dots (i.e., fused broadcasting) at every subexpression. When you use it, you don’t have to insert any dots manually. (I’ve never seen it placed in front of the entire method definition before, not sure what it would do there.)
When benchmarking with @btime you should interpolate global variables with $, otherwise the expression can’t be type inferred and compiled to efficient code.
Minor point: you don’t need to terminate lines with a semicolon in Julia.
Dear Daniel,
Many thanks for your response. I tried a very similar code modification before and it gave me the same result. On the other hand, the code revised by you provides this output: 10.512 s (1698000 allocations: 26.64 MiB)
As i understand, Julia does’t like long formulas. Previously, i did another performance test where i did no more than 3 operations (sum and double square). In this case, Julia significantly outperformed Matlab
@rveltz might be onto something. B is never initialized; similar returns uninitialized memory. What did you intend to fill B with before your computations?
You are right. It should be B = CUDA.rand(151,151,151)
Then, there are no more NaN values. However, it slightly affected the performance. The output is as follows: 9.379 s (1671000 allocations: 26.09 MiB)
Dear Imiq,
My version of the code run on CPU gives this output: 45.033 s (15000 allocations: 656.25 KiB)
The code revised by danielwe provides another result: 47.197 s (33000 allocations: 1.01 MiB)
Then there is an issue with broadcasting fusion. Maybe the * symbols are still performing matrix multiplication? I never fully trust @. because I’m not sure how it works
Then I would probably try to split that long line in a few smaller lines, too see what happens.
Edit: effectielly, if you break that long line, it becomes non-allocating in the CPU:
A = rand(51,51,51);
B = rand(A);
C = zeros(51,51,51);
D = similar(C);
E = similar(C);
F = similar(C);
function math1!(C, A, B)
@. C = A^2 + B^2 + A * B
@. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
return C
end