Why Julia is much slower than MATLAB on GPU computing?

DennisFang · November 20, 2023, 6:30am

Hello,

I use a local computer, consisting of an NVIDIA RTX 3090 GPU, with a Windows 10 system to run models in different programming languages and compare their performance. When comparing Julia (v1.9.3) and MATLAB (2023a), I find that MATLAB seems to have much better memory usage than Julia on a GPU, allowing it to run faster than Julia.

In order to make the program accessible for everyone to run and test, I provide a minimum example as follows. In this example, the results show that GPU computing in MATLAB is over 31 times faster than that in Julia. Additionally, MATLAB GPU memory usage (6.4 GB) is far lower than Julia’s (> 32 GB). I am surprised at the significant differences in performance. I would like to know whether Julia’s memory management is inefficient or if the Julia program could be revised to improve performance.

Here is a minimum example in Julia:

using CUDA

function DGP(N)
    x = range(0, 1, N^2)
    return reshape(x, (N, N))
end

function main(N)
    x = CuArray(DGP(N))
    V0 = CUDA.ones(Float64, N); idx = ()
    a = 0.5
    max_iter = 100
    iter = 0
    while iter < max_iter
        V1 = copy(V0)
        V0, idx = findmax(x .+ a * V1', dims=2)
        @show iter += 1
    end
    return V0, idx, iter
end

CUDA.@time CUDA.@sync V, idx, iter = main(2^13)

The following is the MATLAB version:

tic
[V0, idx, iter] = main(2^13);
toc	

function [res] = DGP(N)
   x = linspace(0, 1, N^2);
   res = reshape(x, N, N);
end

function [V0, idx, iter] = main(N)
   x = gpuArray(DGP(N));
   V0 = gpuArray.ones(N, 1); idx = zeros(N, 1);
   a = 0.5;
   max_iter = 100; 
   iter = 0;
   while iter < max_iter
       disp(['Start the iteration: ', num2str(iter+1)])
       V1 = V0;
       [V0, idx] = max(x + a * V1', [], 2);
       iter = iter + 1;
   end    
end

Under the N=2^13 case, the runtime of Julia is 17.5817 seconds, and that of MATLAB is 0.5653 seconds.

Here are Windows Task Manager screenshots when running the program in Julia and MATLAB, respectively.

The figures show that Julia uses much more memory in this problem, leading to more time spent resetting the memory. Meanwhile, MATLAB uses less memory, and its memory usage is far below the memory limit, so it does not spend any time cleaning the memory.

mkitti · November 20, 2023, 6:37am

In the Julia case, why are you making a copy every single iteration of the for loop?

DennisFang · November 20, 2023, 6:46am

In my original study, I needed to update the variable.

Back to this minimum example, I try to use V1 = V0 instead of copying, but I still get a similar performance as before.

maleadt · November 20, 2023, 7:00am

Only takes 0.5s on my RTX6000, which is faster than your RTX3090 but not 20x.

You should provide additional information for people to be able to help you, e.g., the CUDA.jl version (by showing CUDA.versioninfo()), running under CUDA.@time and CUDA.@profile to provide some minimal timing information, etc. Also try to use CUDA.jl#master. Example output here:

julia> CUDA.@time main()
  0.529043 seconds (20.34 k CPU allocations: 513.009 MiB, 0.79% gc time) (502 GPU allocations: 50.537 GiB, 63.05% memmgmt time)

julia> CUDA.@profile main()
Profiler ran for 536.75 ms, capturing 8100 events.

Host-side activity: calling CUDA APIs took 330.32 ms (61.54% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   61.54% │  330.32 ms │     2 │ 165.16 ms ± 233.57 (   0.0 ‥ 330.32) │ cuStreamSynchronize     │
│    8.75% │   46.97 ms │     1 │                                      │ cuMemcpyHtoDAsync       │
│    0.30% │    1.61 ms │   505 │   3.18 µs ± 19.6   (  0.95 ‥ 395.54) │ cuMemAllocFromPoolAsync │
│    0.30% │    1.59 ms │   501 │   3.17 µs ± 1.24   (  2.38 ‥ 20.03)  │ cuLaunchKernel          │
│    0.10% │  537.63 µs │   460 │   1.17 µs ± 0.47   (  0.72 ‥ 9.06)   │ cuMemFreeAsync          │
│    0.01% │   43.39 µs │     3 │  14.46 µs ± 4.69   (  9.06 ‥ 17.4)   │ cuMemGetInfo            │
│    0.00% │   13.11 µs │     2 │   6.56 µs ± 3.54   (  4.05 ‥ 9.06)   │ cuCtxSynchronize        │
│    0.00% │    1.43 µs │     6 │ 238.42 ns ± 150.79 (   0.0 ‥ 476.84) │ cuMemPoolGetAttribute   │
│    0.00% │  715.26 ns │     9 │  79.47 ns ± 119.21 (   0.0 ‥ 238.42) │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 405.64 ms (75.57% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                          ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   43.87% │  235.48 ms │   100 │   2.35 ms ± 0.05   (  2.25 ‥ 2.44) │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI7Float645Int64E16CartesianIndicesILi2ES2_I5OneToIS4_ES6_IS4_EEES5_ILi2ES2_IS6_IS ⋯
│   24.17% │  129.71 ms │    99 │   1.31 ms ± 0.01   (  1.29 ‥ 1.32) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    7.19% │    38.6 ms │     1 │                                    │ [copy pageable to device memory]                                                                                                              ⋯
│    0.24% │    1.31 ms │     1 │                                    │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    0.04% │  193.36 µs │   100 │   1.93 µs ± 0.18   (  1.67 ‥ 2.38) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE3_31S4_ ⋯
│    0.04% │  191.21 µs │   100 │   1.91 µs ± 0.17   (  1.67 ‥ 2.15) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI14CartesianIndexILi2EELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5 ⋯
│    0.03% │  142.34 µs │    99 │   1.44 µs ± 0.18   (  1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_IS ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z2_615CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EES1_                                                                                    ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE2_9I1_ES4_IS1_8 ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Regarding the memory usage: Julia being a GC’d language will always consume more memory, but the large difference is likely caused by CUDA.jl using a memory pool (which causes freed objects not being visible as freed memory; use CUDA.memory_status() if you want to differentiate between used and cached memory).

maxfreu · November 20, 2023, 9:08am

Have you checked that the Matlab example uses Float64 as well? Usually, consumer GPUs are much slower for F64 than F32.

nsajko · November 20, 2023, 9:19am

Try removing the @show?

Benny · November 20, 2023, 9:24am

I doubt I can help with the performance, but I have a couple questions about the translation:

idx = zeros(N, 1); in MATLAB would be idx = CartesianIndex(ntuple(i->0, N)) in Julia, not idx = (). But should it be that in either language? What is idx in [V0, idx] = max(x + a * V1', [], 2);? The docs doesn’t quite cover that usage, only the "linear" option which gives an integer scalar.
V1 = V0; in MATLAB looks like it should be V1=V0 in Julia, not V1=copy(V0).
Have you benchmarked it without disp/@show? Printing in tight loops tend to obscure the number crunching in benchmarks quite a bit.
Skimming the docs, ones(n,1,"gpuArray") is suggested. Does gpuArray have a ones member method to do the same thing with gpuArray.ones(N, 1)? Just checking.

maxfreu · November 20, 2023, 9:24am

Oh and try to use a scratch array to store the intermediate result:

ulia> function main(N)
           x = CuArray(DGP(N))
           V0 = CUDA.ones(Float64, N); idx = ()
           a = 0.5
           max_iter = 100
           iter = 0
           tmp = x .+ a * V0'
           while iter < max_iter
               V1 = V0
               tmp .= x .+ a * V1'
               V0, idx = findmax(tmp, dims=2)
               iter += 1
           end
           return V0, idx, iter
       end

That should get rid of most of the memory management time.

Benny · November 20, 2023, 9:32am

First glance a * V1' and findmax should still allocate, maybe use V0 and some sort of index array intermediate in mul! and findmax!? ~~findmax! seems to only support linear indexing, though.~~ scratch that, pass in a ::Vector{CartesianIndex{2}}, ::Vector{Int} doesn’t even work.

photor · November 20, 2023, 9:38am

interesting, will have a try

nsajko · November 20, 2023, 9:48am

The CUDA.jl Performance tips manual also mentions that using Int32 can be faster than using Int64: Performance Tips · CUDA.jl

Benny · November 20, 2023, 10:01am

Somewhat complicated by double (our Float64) being MATLAB’s default numeric type, even for written integers.

DennisFang · November 20, 2023, 10:13am

Yes, it’s Double. In my study, I use while loop and stop by error criterion with 1e-10, so I need to use Float64 rather than Float32 to get higher precision.

DennisFang · November 20, 2023, 10:19am

I still get a similar speed.

Moreover, I find in the 75th iteration, it would stop run the program to clean up the memory. After that, it would run the remaining iterations.

DennisFang · November 20, 2023, 10:43am

Actually, in my original study, I used a while loop and stopped it using an error criterion. To simplify, I revised it as a for loop in the MWE.
In the MATLAB, idx = zeros(N, 1); can be deleted. It still prints out idx after the loop execution. In contrast, in Julia, it needs to declare the variable space outside the loop; otherwise, it cannot be printed out.
For max in MATLAB and findmax in Julia, I use them to print out “what” and “where” is the largest value in dims=2 at the same time.
Yes, in my original study, I needed to use copy to ensure V1 and V0 separately exist at the same time, and then find the error to check whether to stop the while loop. For the MWE, copy seems useless.
I tried running a version without any show or println for the iteration info. The Julia program still runs at a similar speed.
ones(n,1,"gpuArray") and gpuArray.ones(n, 1) are the same in MATLAB 2023a.

Mason · November 20, 2023, 10:52am

you can write local idx outside the loop rather than initializing it to a type unstable variable.

DennisFang · November 20, 2023, 10:54am

Thanks. I got a similar speed as MATLAB does. In addition, I find the memory usage is far lower than original MWE Julia version, and it also is similar to MWE MATLAB version.

Mason · November 20, 2023, 10:55am

For what it’s worth, your code runs in 0.87 seconds for me on my RTX 3070 without me making any modifications.

Here’s my version info for julia:

julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa948543 (2023-11-03 07:44 UTC)

Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
  Threads: 12 on 12 virtual cores
Environment:
  JULIA_NUM_THREADS = 6

and for CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.129.3

CUDA libraries: 
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.129.3

Julia packages: 
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.10.0-rc1
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 39.125 MiB / 8.000 GiB available)

Mason · November 20, 2023, 10:59am

Testing @maxfreu’s version, it runs in 0.749 seconds for me on my machine. So faster, but only by 15%.

@DennisFang are you including compile time or something in your results? A common pitfall people experience in julia is writing everything into a script and then re-running that script, i.e. julia myscript.jl. If you do that, you’ll re-trigger compilation on every run.

On my machine, the first call to main takes around 8 seconds, which is almost all compile time, but all calls after that take less than a second.

For example, here’s a fresh session:

julia> using CUDA

julia> function DGP(N)
           x = range(0, 1, N^2)
           return reshape(x, (N, N))
       end
DGP (generic function with 1 method)

julia> function main(N)
                  x = CuArray(DGP(N))
                  V0 = CUDA.ones(Float64, N); idx = ()
                  a = 0.5
                  max_iter = 100
                  iter = 0
                  tmp = x .+ a * V0'
                  while iter < max_iter
                      V1 = V0
                      tmp .= x .+ a * V1'
                      V0, idx = findmax(tmp, dims=2)
                      iter += 1
                  end
                  return V0, idx, iter
              end
main (generic function with 1 method)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(5)
  6.399324 seconds (12.34 M CPU allocations: 858.341 MiB, 2.14% gc time) (404 GPU allocations: 23.906 KiB, 0.13% memmgmt time)
([1.8333333333333335; 1.875; … ; 1.9583333333333335; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 5); CartesianIndex(2, 5); … ; CartesianIndex(4, 5); CartesianIndex(5, 5);;], 100)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(5)
  0.002766 seconds (15.98 k CPU allocations: 899.406 KiB) (404 GPU allocations: 23.906 KiB, 11.85% memmgmt time)
([1.8333333333333335; 1.875; … ; 1.9583333333333335; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 5); CartesianIndex(2, 5); … ; CartesianIndex(4, 5); CartesianIndex(5, 5);;], 100)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(2^13)
  0.755437 seconds (57.94 k CPU allocations: 515.690 MiB, 1.44% gc time) (404 GPU allocations: 1.037 GiB, 0.28% memmgmt time)
([1.9998779445868424; 1.9998779594880038; … ; 1.9999999850988386; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 8192); CartesianIndex(2, 8192); … ; CartesianIndex(8191, 8192); CartesianIndex(8192, 8192);;], 100)

DennisFang · November 20, 2023, 11:06am

I know it needs to spend some time compiling for the 1st run in Julia, so I run the same function several times at least to identify.

Topic		Replies	Views
Why is this Julia code considerably slower than Matlab New to Julia performance	64	8592	March 5, 2017
Matlab versus Julia General Usage	33	5063	July 15, 2021
GPU Julia vs GPU Matlab New to Julia gpu	61	1293	November 18, 2024
Benchmark MATLAB & Julia for Matrix Operations Performance	148	20136	October 15, 2019
MATLAB outperforms Julia (20 times faster) running this nested loop Performance question , performance , matlab , time , finitediff	20	5764	January 10, 2023

Why Julia is much slower than MATLAB on GPU computing?

Related topics