GPU Julia vs GPU Matlab

FWIW it seems like all the in-place stuff doesn’t actually give any performance advantage here. On my machine (RTX 3070),

function math(A, B)
    math1(A, B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    math2(C) = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
    math3(D) = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
    
    C = math1.(A, B)
    D = math2.(C)
    E = math3.(D)
    (C,D,E)
end

function g() 
    A = CUDA.rand(151,151,151) .+ 1;
    B = CUDA.rand(151,151,151) .+ 1;
    @btime let A=$A, B=$B, E
        for iter = 1:1000
            (C,D,E) = math(A, B)
        end
    end         
end

is just as fast as the one doing in-place mutation (175.470 ms).

@Alex90 are you sure the julia version you benchmarked isn’t including the sum? I know the 3070 is a lot faster than your 960, but I’d be surprised if it was 10x faster.