GPU Julia vs GPU Matlab

There does seem to be some problem inferring with these very complicated broadcasting expressions. On the CPU, I get a factor of 3 speedup & zero allocations by broadcasting one inner function.

I suppose these expressions are made up, but if your real ones have anything like this degree of repetition, removing it really helps. In all a factor 20:

julia> @btime for iter in 1:1
           math1!($C, $A, $B)  # original functions, on Arrays of original size
           math2!($D, $C)
           math3!($E, $D)
       end
  69.521 ms (30 allocations: 864 bytes)

julia> function math1i!(C, A, B)
           inner(A,B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
           C .= inner.(A, B)
           return C
       end
# and the others similarly

julia> @btime for iter in 1:1  # 
...
  19.776 ms (0 allocations: 0 bytes)

julia> using CommonSubexpressions

julia> function math1c!(C, A, B)
           inner(A,B) = @cse A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
           C .= inner.(A, B)
           return C
       end

julia> @btime for iter in 1:1
...
  3.322 ms (0 allocations: 0 bytes)

Trying those on a GPU, the effect of avoiding very complicated Broadcasted is strong, but the effect of CSE is weak. (Because more computation while working over the same memory once is roughly free.)

julia> @btime CUDA.@sync for iter in 1:1000
           math1!($C, $A, $B)  # from above
           math2!($D, $C)
           math3!($E, $D)
       end
  697.650 ms (1698001 allocations: 26.64 MiB)

julia> @btime CUDA.@sync for iter in 1:1000
           math1i!($C, $A, $B)  # simpler broadcast
           math2i!($D, $C)
           math3i!($E, $D)
       end
  186.972 ms (156001 allocations: 2.56 MiB)

julia> @btime CUDA.@sync for iter in 1:1000
           math1c!($C, $A, $B)  # with CSE
           math2c!($D, $C)
           math3c!($E, $D)
       end
  179.948 ms (156001 allocations: 2.56 MiB)

julia> CUDA.device()
CuDevice(0): Tesla V100-PCIE-16GB

I think “short formulas” here means more separate broadcasts. That will need more memory (or more pre-allocated containers) and making multiple passes is usually slower.

However, your “long formulas” appear to be long enough to cause inference to give up, or something? The fact that they do not run with zero allocations on the CPU is a bad sign. This is a separate problem, and it can probably be solved by asking less of the broadcasting machinery.

Edit, I see line-breaking was discussed a bit above too:

Splitting in two avoids the inference/allocation problem, but doesn’t add much speed, as we need to go over the data twice. (I didn’t try on GPU though, CPU details below.)

julia> function math1lmiq!(C, A, B)  # two-pass function, suggested by @lmiq above
           @. C = A^2 + B^2 + A * B 
           # correction, this needs to be += (and .+= has its own performance headaches)
           @. C += A / B - A * B - A / B + A * B + A / B - A * B - A / B
           return C
       end
math1lmiq! (generic function with 1 method)

julia> function math2lmiq!(D, C)  # same idea
           @. D = C^2 + C^2 + C * C + C / C - C * C - C / C
           @. D += C * C + C / C - C * C - C / C
           return D
       end
math2lmiq! (generic function with 1 method)

julia> function math3lmiq!(E, D)  # same idea
           @. E = D^2 + D^2 + D * D + D / D - D * D - D / D  
           @. E += D * D + D / D - D * D - D / D
           return E
       end
math3lmiq! (generic function with 1 method)

julia> @btime for iter = 1:1
           math1lmiq!($C, $A, $B)
           math2lmiq!($D, $C)
           math3lmiq!($E, $D)
       end
  55.559 ms (0 allocations: 0 bytes)
4 Likes

Fancy, this results for me in 364ms compared to 1.5s before:

 using CUDA, BenchmarkTools
 
 function math1!(C, A, B)
     inner(A, B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
     C .= inner.(A, B) 
     return C
 end
 
 function math2!(D, C)
     inner(C) = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
     D .= inner.(C) 
     return D
 end
 
 function math3!(E, D)
     inner(D) = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
     E .= inner.(D)   
     return E
 end
 
 function f() 
     A = CUDA.rand(151,151,151) .+ 1;
     B = CUDA.rand(151,151,151) .+ 1;
     C = CUDA.zeros(151,151,151) .+ 1;
     D = similar(C);
     E = similar(C);
     F = similar(C);
     @btime CUDA.@sync begin for iter = 1:1000
             math1!($C, $A, $B)
             math2!($D, $C)
             math3!($E, $D)
         end 
         sum($E) 
     end         
 end
 
 f()
6 Likes

My hunch is that dot fusion breaks down when there are too many dots, similar to how tuples which are too long can break type inference.

4 Likes

Frankly, I am astonished by Matlab. For example, this expression:

C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;

Is it able to make a kernel for that? Does this kernel fuses all operations or does it have many successive kernels? We have metapogramming to extract the function, how Matlab is able to do that then?

If it does not, then there are a lot of temporaries that need to be created and that will kill the perfs on GPU

1 Like

Dear mcabbott,
Many thanks for this clear explanation!

I got a similar speed up
9.324 s (1671000 allocations: 26.09 MiB)
vs
1.978 s (156116 allocations: 2.57 MiB)
Like i said before, Julia is hard :face_with_diagonal_mouth:

2 Likes

But still, Julia is slower than Matlab. Let’s make comparison a bit more fair and slightly optimize Matlab code too. The revised version is as follows:

A = (single(rand(151,151,151,'gpuArray')));
B = (single(rand(151,151,151,'gpuArray')));
C = (single(zeros(151,151,151,'gpuArray')));
D = (single(zeros(151,151,151,'gpuArray')));
E = (single(zeros(151,151,151,'gpuArray')));
tic
for iter = 1:1000
[C, D, E] = arrayfun(@math,A,B);
end
toc
function [C, D, E] = math(A,B)
    C = A.^2 + B.^2 + A .* B + A ./ B - A .* B - A ./ B + A .* B + A ./ B - A .* B - A ./ B;
    D = C.^2 + C.^2 + C .* C + C ./ C - C .* C - C ./ C + C .* C + C ./ C - C .* C - C ./ C;
     E= D.^2 + D.^2 + D .* D + D ./ D - D .* D - D ./ D + D .* D + D ./ D - D .* D - D ./ D;
end

Now it takes 0.98 s for Matlab vs 1.978 s in Julia on my GTX 960

2 Likes

Which function are you measuring in Julia now?

What do these lines do?

FWIW it seems like all the in-place stuff doesn’t actually give any performance advantage here. On my machine (RTX 3070),

function math(A, B)
    math1(A, B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
    math2(C) = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
    math3(D) = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
    
    C = math1.(A, B)
    D = math2.(C)
    E = math3.(D)
    (C,D,E)
end

function g() 
    A = CUDA.rand(151,151,151) .+ 1;
    B = CUDA.rand(151,151,151) .+ 1;
    @btime let A=$A, B=$B, E
        for iter = 1:1000
            (C,D,E) = math(A, B)
        end
    end         
end

is just as fast as the one doing in-place mutation (175.470 ms).

@Alex90 are you sure the julia version you benchmarked isn’t including the sum? I know the 3070 is a lot faster than your 960, but I’d be surprised if it was 10x faster.

The first line says that the call to arrayfun(@math, A, B) returns the arguments [C, D, E] and the second line says that the local function @math(A, B) is defined with those return arguments.

It does the same thing the code I wrote in the comment above this one.

2 Likes

I run roflmaostc code

using CUDA, BenchmarkTools
  function math1!(C, A, B)
     inner(A, B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
     C .= inner.(A, B) 
     return C
 end
  function math2!(D, C)
     inner(C) = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
     D .= inner.(C) 
     return D
 end
  function math3!(E, D)
     inner(D) = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
     E .= inner.(D)   
     return E
 end
 
 function f() 
     A = CUDA.rand(151,151,151) .+ 1;
     B = CUDA.rand(151,151,151) .+ 1;
     C = CUDA.zeros(151,151,151) .+ 1;
     D = similar(C);
     E = similar(C);
     F = similar(C);
     @btime CUDA.@sync begin for iter = 1:1000
             math1!($C, $A, $B)
             math2!($D, $C)
             math3!($E, $D)
         end 
         sum($E) 
     end         
 end
  f()

My Julia version is 1.11.1. I write the code in IJulia notebook. I executed your code and got the following:
4.412 s (171004 allocations: 3.20 MiB)
I think it is not surprising that RTX 3070 significantly outperforms GTX 960

It is like Mason explained. One note is arrayfun function. It speeds up the computations. It works element-wise and processes each element of the input array. Actually, i don’t fully understand how it works on the low-level, but in my main program it speeds up the computations by a factor of 1.6. Without arrayfun function, the code will be similar to Julia:
[C, D, E] = math(A,B);

1 Like

Note that in the code you’re running there, you’re calling sum(E) at the end.

I doubt that makes a big difference here since it’s outside the for loop.

Interesting that you see such a big perf difference where I don’t. I wonder if we’re using a driver version on the julia side which doesn’t support your GPU very well whereas Matlab chose a better driver for your GPU?

It’s an old enough GPU that I wouldn’t be surprised if this was just a support problem. I’d love to try benchmarking the matlab code on my machine, but I’m certainly not paying Mathworks a dime.

When using CUDA.versioninfo() in Julia, i got this output:

CUDA runtime 12.6, artifact installation
CUDA driver 12.2
NVIDIA driver 536.23.0

CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+536.23

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce GTX 960 (sm_52, 2.149 GiB / 4.000 GiB available)

When using gpuDevice() in Matlab, the output is as follows:

                      Name: 'NVIDIA GeForce GTX 960'
                     Index: 1
         ComputeCapability: '5.2'
            SupportsDouble: 1
     GraphicsDriverVersion: '536.23'
               DriverModel: 'WDDM'
            ToolkitVersion: 11.8000
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152 (49.15 KB)
        MaxThreadBlockSize: [1×3 double]
               MaxGridSize: [1×3 double]
                 SIMDWidth: 32
               TotalMemory: 4294836224 (4.29 GB)
           AvailableMemory: 3449489819 (3.45 GB)
               CachePolicy: 'balanced'
       MultiprocessorCount: 8
              ClockRateKHz: 1266000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 1
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

Hm interesting. I don’t know enough about GPU stuff to debug this further unfortunately. Quite curious to know what’s going though!

My guess here is that both slow paths perform the whole broadcast to write C before starting the one to write D, etc. While the fast path fuses them, to write three arrays at once – the dots in the math function are then misleading, it is acting on scalars.

My GPU isn’t having a good day but this variant of Mason’s code above might work? See e.g. julia#22129 for more on this “broadcast to multiple arrays” idea, and StructArrays.jl (I think discussed here recently too, but I can’t find the thread.)

julia> function math_sa(A, B)
           math1(A, B) = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
           math2(C) = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
           math3(D) = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
           function fun(A, B)  # also acts on scalars not arrays
             C = math1(A, B)
             D = math2(C)
             E = math3(D)
             (C,D,E)
           end
          # StructArrays.components(StructArray(Iterators.map(fun, A, B))) 
          bc = Broadcast.instantiate(Broadcast.broadcasted(fun, A, B))  # lazy
          StructArrays.components(StructArray(bc))  # writes to 3 arrays at once
       end
2 Likes

StructArrays.jl unfortunately does not play well with CUDA yet. Someone would need to write some specialized methods for it.

1 Like

I do think GPU arrays are meant to work, but can be fragile. After fiddling a bit… note that some approaches above may produce Core.Box as a result of re-using the name A for different things… here’s a version which runs:

julia> function math_sa_3(As, Bs)
         function inner(AB::NamedTuple)
           A = AB.A  # A isa Float32
           B = AB.B
           C = A^2 + B^2 + A * B + A / B - A * B - A / B + A * B + A / B - A * B - A / B
           D = C^2 + C^2 + C * C + C / C - C * C - C / C + C * C + C / C - C * C - C / C
           E = D^2 + D^2 + D * D + D / D - D * D - D / D + D * D + D / D - D * D - D / D
           (C,D,E)
         end
         ABs = StructArray(A=As, B=Bs)  # just wrapping, not allocation
         CDEs = inner.(ABs)  # makes a new StructArray{Tuple}
         StructArrays.components(CDEs)  # just un-wrapping
       end
math_sa_3 (generic function with 1 method)

julia> let
           A = CUDA.rand(151,151,151) .+ 1;
           B = CUDA.rand(151,151,151) .+ 1;
           @btime let A=$A, B=$B, E
               for iter = 1:1000
                   (C,D,E) = math(A, B)  # from Mason, https://discourse.julialang.org/t/gpu-julia-vs-gpu-matlab/122663/50
               end
           end
           @btime let A=$A, B=$B, E
               for iter = 1:1000
                   (C,D,E) = math_sa_3(A, B)
               end
           end
       end
  123.622 ms (171000 allocations: 3.20 MiB)
  54.815 ms (108000 allocations: 3.43 MiB)

Haven’t thought much about whether this benchmark is honest or not. Including sum(E) (which syncronises, I think) after every C,D,E = math... call slows both down to around 330 ms, but perhaps that’s just the cost of 1000 syncs. Maybe a a factor of 3 improvement sounds about right, for combining 3 memory-bound loops into one?

Code from here with math1!(C, A, B) etc. takes 187.309 ms for me.

3 Likes