FWIW it seems like all the in-place stuff doesn’t actually give any performance advantage here. On my machine (RTX 3070),
function math(A, B)
math1(A, B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
math2(C) = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
math3(D) = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
C = math1.(A, B)
D = math2.(C)
E = math3.(D)
(C,D,E)
end
function g()
A = CUDA.rand(151,151,151) .+ 1;
B = CUDA.rand(151,151,151) .+ 1;
@btime let A=$A, B=$B, E
for iter = 1:1000
(C,D,E) = math(A, B)
end
end
end
is just as fast as the one doing in-place mutation (175.470 ms
).
@Alex90 are you sure the julia version you benchmarked isn’t including the sum
? I know the 3070 is a lot faster than your 960, but I’d be surprised if it was 10x faster.