Why Julia is much slower than MATLAB on GPU computing?

In my original study, I needed to update the variable.

Back to this minimum example, I try to use V1 = V0 instead of copying, but I still get a similar performance as before.

Only takes 0.5s on my RTX6000, which is faster than your RTX3090 but not 20x.

You should provide additional information for people to be able to help you, e.g., the CUDA.jl version (by showing CUDA.versioninfo()), running under CUDA.@time and CUDA.@profile to provide some minimal timing information, etc. Also try to use CUDA.jl#master. Example output here:

julia> CUDA.@time main()
  0.529043 seconds (20.34 k CPU allocations: 513.009 MiB, 0.79% gc time) (502 GPU allocations: 50.537 GiB, 63.05% memmgmt time)

julia> CUDA.@profile main()
Profiler ran for 536.75 ms, capturing 8100 events.

Host-side activity: calling CUDA APIs took 330.32 ms (61.54% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   61.54% │  330.32 ms │     2 │ 165.16 ms ± 233.57 (   0.0 ‥ 330.32) │ cuStreamSynchronize     │
│    8.75% │   46.97 ms │     1 │                                      │ cuMemcpyHtoDAsync       │
│    0.30% │    1.61 ms │   505 │   3.18 µs ± 19.6   (  0.95 ‥ 395.54) │ cuMemAllocFromPoolAsync │
│    0.30% │    1.59 ms │   501 │   3.17 µs ± 1.24   (  2.38 ‥ 20.03)  │ cuLaunchKernel          │
│    0.10% │  537.63 µs │   460 │   1.17 µs ± 0.47   (  0.72 ‥ 9.06)   │ cuMemFreeAsync          │
│    0.01% │   43.39 µs │     3 │  14.46 µs ± 4.69   (  9.06 ‥ 17.4)   │ cuMemGetInfo            │
│    0.00% │   13.11 µs │     2 │   6.56 µs ± 3.54   (  4.05 ‥ 9.06)   │ cuCtxSynchronize        │
│    0.00% │    1.43 µs │     6 │ 238.42 ns ± 150.79 (   0.0 ‥ 476.84) │ cuMemPoolGetAttribute   │
│    0.00% │  715.26 ns │     9 │  79.47 ns ± 119.21 (   0.0 ‥ 238.42) │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 405.64 ms (75.57% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                          ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   43.87% │  235.48 ms │   100 │   2.35 ms ± 0.05   (  2.25 ‥ 2.44) │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI7Float645Int64E16CartesianIndicesILi2ES2_I5OneToIS4_ES6_IS4_EEES5_ILi2ES2_IS6_IS ⋯
│   24.17% │  129.71 ms │    99 │   1.31 ms ± 0.01   (  1.29 ‥ 1.32) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    7.19% │    38.6 ms │     1 │                                    │ [copy pageable to device memory]                                                                                                              ⋯
│    0.24% │    1.31 ms │     1 │                                    │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    0.04% │  193.36 µs │   100 │   1.93 µs ± 0.18   (  1.67 ‥ 2.38) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE3_31S4_ ⋯
│    0.04% │  191.21 µs │   100 │   1.91 µs ± 0.17   (  1.67 ‥ 2.15) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI14CartesianIndexILi2EELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5 ⋯
│    0.03% │  142.34 µs │    99 │   1.44 µs ± 0.18   (  1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_IS ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z2_615CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EES1_                                                                                    ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE2_9I1_ES4_IS1_8 ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Regarding the memory usage: Julia being a GC’d language will always consume more memory, but the large difference is likely caused by CUDA.jl using a memory pool (which causes freed objects not being visible as freed memory; use CUDA.memory_status() if you want to differentiate between used and cached memory).

2 Likes

Have you checked that the Matlab example uses Float64 as well? Usually, consumer GPUs are much slower for F64 than F32.

Try removing the @show?

I doubt I can help with the performance, but I have a couple questions about the translation:

  1. idx = zeros(N, 1); in MATLAB would be idx = CartesianIndex(ntuple(i->0, N)) in Julia, not idx = (). But should it be that in either language? What is idx in [V0, idx] = max(x + a * V1', [], 2);? The docs doesn’t quite cover that usage, only the "linear" option which gives an integer scalar.
  2. V1 = V0; in MATLAB looks like it should be V1=V0 in Julia, not V1=copy(V0).
  3. Have you benchmarked it without disp/@show? Printing in tight loops tend to obscure the number crunching in benchmarks quite a bit.
  4. Skimming the docs, ones(n,1,"gpuArray") is suggested. Does gpuArray have a ones member method to do the same thing with gpuArray.ones(N, 1)? Just checking.
2 Likes

Oh and try to use a scratch array to store the intermediate result:

ulia> function main(N)
           x = CuArray(DGP(N))
           V0 = CUDA.ones(Float64, N); idx = ()
           a = 0.5
           max_iter = 100
           iter = 0
           tmp = x .+ a * V0'
           while iter < max_iter
               V1 = V0
               tmp .= x .+ a * V1'
               V0, idx = findmax(tmp, dims=2)
               iter += 1
           end
           return V0, idx, iter
       end

That should get rid of most of the memory management time.

4 Likes

First glance a * V1' and findmax should still allocate, maybe use V0 and some sort of index array intermediate in mul! and findmax!? findmax! seems to only support linear indexing, though. scratch that, pass in a ::Vector{CartesianIndex{2}}, ::Vector{Int} doesn’t even work.

interesting, will have a try

The CUDA.jl Performance tips manual also mentions that using Int32 can be faster than using Int64: Performance Tips · CUDA.jl

Somewhat complicated by double (our Float64) being MATLAB’s default numeric type, even for written integers.

Yes, it’s Double. In my study, I use while loop and stop by error criterion with 1e-10, so I need to use Float64 rather than Float32 to get higher precision.

I still get a similar speed.

Moreover, I find in the 75th iteration, it would stop run the program to clean up the memory. After that, it would run the remaining iterations.

  1. Actually, in my original study, I used a while loop and stopped it using an error criterion. To simplify, I revised it as a for loop in the MWE.
    In the MATLAB, idx = zeros(N, 1); can be deleted. It still prints out idx after the loop execution. In contrast, in Julia, it needs to declare the variable space outside the loop; otherwise, it cannot be printed out.
    For max in MATLAB and findmax in Julia, I use them to print out “what” and “where” is the largest value in dims=2 at the same time.

  2. Yes, in my original study, I needed to use copy to ensure V1 and V0 separately exist at the same time, and then find the error to check whether to stop the while loop. For the MWE, copy seems useless.

  3. I tried running a version without any show or println for the iteration info. The Julia program still runs at a similar speed.

  4. ones(n,1,"gpuArray") and gpuArray.ones(n, 1) are the same in MATLAB 2023a.

1 Like

you can write local idx outside the loop rather than initializing it to a type unstable variable.

4 Likes

Thanks. I got a similar speed as MATLAB does. In addition, I find the memory usage is far lower than original MWE Julia version, and it also is similar to MWE MATLAB version.

2 Likes

For what it’s worth, your code runs in 0.87 seconds for me on my RTX 3070 without me making any modifications.

Here’s my version info for julia:

julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa948543 (2023-11-03 07:44 UTC)

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 12 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6

and for CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.129.3

CUDA libraries: 
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.129.3

Julia packages: 
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.10.0-rc1
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 39.125 MiB / 8.000 GiB available)

Testing @maxfreu’s version, it runs in 0.749 seconds for me on my machine. So faster, but only by 15%.

@DennisFang are you including compile time or something in your results? A common pitfall people experience in julia is writing everything into a script and then re-running that script, i.e. julia myscript.jl. If you do that, you’ll re-trigger compilation on every run.

On my machine, the first call to main takes around 8 seconds, which is almost all compile time, but all calls after that take less than a second.

For example, here’s a fresh session:

julia> using CUDA

julia> function DGP(N)
           x = range(0, 1, N^2)
           return reshape(x, (N, N))
       end
DGP (generic function with 1 method)

julia> function main(N)
                  x = CuArray(DGP(N))
                  V0 = CUDA.ones(Float64, N); idx = ()
                  a = 0.5
                  max_iter = 100
                  iter = 0
                  tmp = x .+ a * V0'
                  while iter < max_iter
                      V1 = V0
                      tmp .= x .+ a * V1'
                      V0, idx = findmax(tmp, dims=2)
                      iter += 1
                  end
                  return V0, idx, iter
              end
main (generic function with 1 method)
julia> CUDA.@time CUDA.@sync V, idx, iter = main(5)
  6.399324 seconds (12.34 M CPU allocations: 858.341 MiB, 2.14% gc time) (404 GPU allocations: 23.906 KiB, 0.13% memmgmt time)
([1.8333333333333335; 1.875; … ; 1.9583333333333335; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 5); CartesianIndex(2, 5); … ; CartesianIndex(4, 5); CartesianIndex(5, 5);;], 100)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(5)
  0.002766 seconds (15.98 k CPU allocations: 899.406 KiB) (404 GPU allocations: 23.906 KiB, 11.85% memmgmt time)
([1.8333333333333335; 1.875; … ; 1.9583333333333335; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 5); CartesianIndex(2, 5); … ; CartesianIndex(4, 5); CartesianIndex(5, 5);;], 100)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(2^13)
  0.755437 seconds (57.94 k CPU allocations: 515.690 MiB, 1.44% gc time) (404 GPU allocations: 1.037 GiB, 0.28% memmgmt time)
([1.9998779445868424; 1.9998779594880038; … ; 1.9999999850988386; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 8192); CartesianIndex(2, 8192); … ; CartesianIndex(8191, 8192); CartesianIndex(8192, 8192);;], 100)

I know it needs to spend some time compiling for the 1st run in Julia, so I run the same function several times at least to identify.

Interesting, I wonder why it’s so slow on your machine then, especially compared to my much slower GPU.

Can you show us your CUDA.versioninfo()?

Looking at your script, you only have the CUDA.@time line once. When you say run it several times, are you 1) running that line repeatedly in the same process, 2) includeing the whole script repeatedly in the same process, 3) running different processes, or 4) something else? Just for clarification.

2 Likes