# Why Julia is much slower than MATLAB on GPU computing?

Hello,

I use a local computer, consisting of an NVIDIA RTX 3090 GPU, with a Windows 10 system to run models in different programming languages and compare their performance. When comparing Julia (v1.9.3) and MATLAB (2023a), I find that MATLAB seems to have much better memory usage than Julia on a GPU, allowing it to run faster than Julia.

In order to make the program accessible for everyone to run and test, I provide a minimum example as follows. In this example, the results show that GPU computing in MATLAB is over 31 times faster than that in Julia. Additionally, MATLAB GPU memory usage (6.4 GB) is far lower than Julia’s (> 32 GB). I am surprised at the significant differences in performance. I would like to know whether Julia’s memory management is inefficient or if the Julia program could be revised to improve performance.

Here is a minimum example in Julia:

``````using CUDA

function DGP(N)
x = range(0, 1, N^2)
return reshape(x, (N, N))
end

function main(N)
x = CuArray(DGP(N))
V0 = CUDA.ones(Float64, N); idx = ()
a = 0.5
max_iter = 100
iter = 0
while iter < max_iter
V1 = copy(V0)
V0, idx = findmax(x .+ a * V1', dims=2)
@show iter += 1
end
return V0, idx, iter
end

CUDA.@time CUDA.@sync V, idx, iter = main(2^13)
``````

The following is the MATLAB version:

``````tic
[V0, idx, iter] = main(2^13);
toc

function [res] = DGP(N)
x = linspace(0, 1, N^2);
res = reshape(x, N, N);
end

function [V0, idx, iter] = main(N)
x = gpuArray(DGP(N));
V0 = gpuArray.ones(N, 1); idx = zeros(N, 1);
a = 0.5;
max_iter = 100;
iter = 0;
while iter < max_iter
disp(['Start the iteration: ', num2str(iter+1)])
V1 = V0;
[V0, idx] = max(x + a * V1', [], 2);
iter = iter + 1;
end
end
``````

Under the N=2^13 case, the runtime of Julia is 17.5817 seconds, and that of MATLAB is 0.5653 seconds.

Here are Windows Task Manager screenshots when running the program in Julia and MATLAB, respectively.

The figures show that Julia uses much more memory in this problem, leading to more time spent resetting the memory. Meanwhile, MATLAB uses less memory, and its memory usage is far below the memory limit, so it does not spend any time cleaning the memory.

1 Like

In the Julia case, why are you making a copy every single iteration of the for loop?

In my original study, I needed to update the variable.

Back to this minimum example, I try to use `V1 = V0` instead of copying, but I still get a similar performance as before.

Only takes 0.5s on my RTX6000, which is faster than your RTX3090 but not 20x.

You should provide additional information for people to be able to help you, e.g., the CUDA.jl version (by showing `CUDA.versioninfo()`), running under `CUDA.@time` and `CUDA.@profile` to provide some minimal timing information, etc. Also try to use CUDA.jl#master. Example output here:

``````julia> CUDA.@time main()
0.529043 seconds (20.34 k CPU allocations: 513.009 MiB, 0.79% gc time) (502 GPU allocations: 50.537 GiB, 63.05% memmgmt time)

julia> CUDA.@profile main()
Profiler ran for 536.75 ms, capturing 8100 events.

Host-side activity: calling CUDA APIs took 330.32 ms (61.54% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   61.54% │  330.32 ms │     2 │ 165.16 ms ± 233.57 (   0.0 ‥ 330.32) │ cuStreamSynchronize     │
│    8.75% │   46.97 ms │     1 │                                      │ cuMemcpyHtoDAsync       │
│    0.30% │    1.61 ms │   505 │   3.18 µs ± 19.6   (  0.95 ‥ 395.54) │ cuMemAllocFromPoolAsync │
│    0.30% │    1.59 ms │   501 │   3.17 µs ± 1.24   (  2.38 ‥ 20.03)  │ cuLaunchKernel          │
│    0.10% │  537.63 µs │   460 │   1.17 µs ± 0.47   (  0.72 ‥ 9.06)   │ cuMemFreeAsync          │
│    0.01% │   43.39 µs │     3 │  14.46 µs ± 4.69   (  9.06 ‥ 17.4)   │ cuMemGetInfo            │
│    0.00% │   13.11 µs │     2 │   6.56 µs ± 3.54   (  4.05 ‥ 9.06)   │ cuCtxSynchronize        │
│    0.00% │    1.43 µs │     6 │ 238.42 ns ± 150.79 (   0.0 ‥ 476.84) │ cuMemPoolGetAttribute   │
│    0.00% │  715.26 ns │     9 │  79.47 ns ± 119.21 (   0.0 ‥ 238.42) │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 405.64 ms (75.57% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                          ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   43.87% │  235.48 ms │   100 │   2.35 ms ± 0.05   (  2.25 ‥ 2.44) │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI7Float645Int64E16CartesianIndicesILi2ES2_I5OneToIS4_ES6_IS4_EEES5_ILi2ES2_IS6_IS ⋯
│   24.17% │  129.71 ms │    99 │   1.31 ms ± 0.01   (  1.29 ‥ 1.32) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    7.19% │    38.6 ms │     1 │                                    │ [copy pageable to device memory]                                                                                                              ⋯
│    0.04% │  193.36 µs │   100 │   1.93 µs ± 0.18   (  1.67 ‥ 2.38) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE3_31S4_ ⋯
│    0.04% │  191.21 µs │   100 │   1.91 µs ± 0.17   (  1.67 ‥ 2.15) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI14CartesianIndexILi2EELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5 ⋯
│    0.03% │  142.34 µs │    99 │   1.44 µs ± 0.18   (  1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_IS ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z2_615CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EES1_                                                                                    ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
``````

Regarding the memory usage: Julia being a GC’d language will always consume more memory, but the large difference is likely caused by CUDA.jl using a memory pool (which causes freed objects not being visible as freed memory; use `CUDA.memory_status()` if you want to differentiate between used and cached memory).

2 Likes

Have you checked that the Matlab example uses Float64 as well? Usually, consumer GPUs are much slower for F64 than F32.

Try removing the `@show`?

I doubt I can help with the performance, but I have a couple questions about the translation:

1. `idx = zeros(N, 1);` in MATLAB would be `idx = CartesianIndex(ntuple(i->0, N))` in Julia, not `idx = ()`. But should it be that in either language? What is `idx` in `[V0, idx] = max(x + a * V1', [], 2);`? The docs doesn’t quite cover that usage, only the `"linear"` option which gives an integer scalar.
2. `V1 = V0;` in MATLAB looks like it should be `V1=V0` in Julia, not `V1=copy(V0)`.
3. Have you benchmarked it without `disp`/`@show`? Printing in tight loops tend to obscure the number crunching in benchmarks quite a bit.
4. Skimming the docs, `ones(n,1,"gpuArray")` is suggested. Does `gpuArray` have a `ones` member method to do the same thing with `gpuArray.ones(N, 1)`? Just checking.
2 Likes

Oh and try to use a scratch array to store the intermediate result:

``````ulia> function main(N)
x = CuArray(DGP(N))
V0 = CUDA.ones(Float64, N); idx = ()
a = 0.5
max_iter = 100
iter = 0
tmp = x .+ a * V0'
while iter < max_iter
V1 = V0
tmp .= x .+ a * V1'
V0, idx = findmax(tmp, dims=2)
iter += 1
end
return V0, idx, iter
end
``````

That should get rid of most of the memory management time.

4 Likes

First glance `a * V1'` and `findmax` should still allocate, maybe use `V0` and some sort of index array intermediate in `mul!` and `findmax!`? `findmax!` seems to only support linear indexing, though. scratch that, pass in a `::Vector{CartesianIndex{2}}`, `::Vector{Int}` doesn’t even work.

interesting, will have a try

The CUDA.jl Performance tips manual also mentions that using `Int32` can be faster than using `Int64`: Performance Tips · CUDA.jl

Somewhat complicated by `double` (our `Float64`) being MATLAB’s default numeric type, even for written integers.

Yes, it’s `Double`. In my study, I use `while` loop and stop by error criterion with 1e-10, so I need to use Float64 rather than Float32 to get higher precision.

I still get a similar speed.

Moreover, I find in the 75th iteration, it would stop run the program to clean up the memory. After that, it would run the remaining iterations.

1. Actually, in my original study, I used a while loop and stopped it using an error criterion. To simplify, I revised it as a for loop in the MWE.
In the MATLAB, `idx = zeros(N, 1);` can be deleted. It still prints out idx after the loop execution. In contrast, in Julia, it needs to declare the variable space outside the loop; otherwise, it cannot be printed out.
For `max` in MATLAB and `findmax` in Julia, I use them to print out “what” and “where” is the largest value in `dims=2` at the same time.

2. Yes, in my original study, I needed to use `copy` to ensure V1 and V0 separately exist at the same time, and then find the error to check whether to stop the while loop. For the MWE, copy seems useless.

3. I tried running a version without any show or println for the iteration info. The Julia program still runs at a similar speed.

4. `ones(n,1,"gpuArray")` and `gpuArray.ones(n, 1)` are the same in MATLAB 2023a.

1 Like

you can write `local idx` outside the loop rather than initializing it to a type unstable variable.

4 Likes

Thanks. I got a similar speed as MATLAB does. In addition, I find the memory usage is far lower than original MWE Julia version, and it also is similar to MWE MATLAB version.

2 Likes

For what it’s worth, your code runs in `0.87 seconds` for me on my RTX 3070 without me making any modifications.

Here’s my version info for julia:

``````julia> versioninfo()
Julia Version 1.10.0-rc1
Commit 5aaa948543 (2023-11-03 07:44 UTC)

Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: 12 × AMD Ryzen 5 5600X 6-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 12 on 12 virtual cores
Environment:
``````

and for CUDA:

``````julia> CUDA.versioninfo()
CUDA runtime 12.3, artifact installation
CUDA driver 12.2
NVIDIA driver 535.129.3

CUDA libraries:
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.129.3

Julia packages:
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.10.0-rc1
- LLVM: 15.0.7

1 device:
0: NVIDIA GeForce RTX 3070 (sm_86, 39.125 MiB / 8.000 GiB available)
``````

Testing @maxfreu’s version, it runs in 0.749 seconds for me on my machine. So faster, but only by 15%.

@DennisFang are you including compile time or something in your results? A common pitfall people experience in julia is writing everything into a script and then re-running that script, i.e. `julia myscript.jl`. If you do that, you’ll re-trigger compilation on every run.

On my machine, the first call to `main` takes around 8 seconds, which is almost all compile time, but all calls after that take less than a second.

For example, here’s a fresh session:

``````julia> using CUDA

julia> function DGP(N)
x = range(0, 1, N^2)
return reshape(x, (N, N))
end
DGP (generic function with 1 method)

julia> function main(N)
x = CuArray(DGP(N))
V0 = CUDA.ones(Float64, N); idx = ()
a = 0.5
max_iter = 100
iter = 0
tmp = x .+ a * V0'
while iter < max_iter
V1 = V0
tmp .= x .+ a * V1'
V0, idx = findmax(tmp, dims=2)
iter += 1
end
return V0, idx, iter
end
main (generic function with 1 method)
``````
``````julia> CUDA.@time CUDA.@sync V, idx, iter = main(5)
6.399324 seconds (12.34 M CPU allocations: 858.341 MiB, 2.14% gc time) (404 GPU allocations: 23.906 KiB, 0.13% memmgmt time)
([1.8333333333333335; 1.875; … ; 1.9583333333333335; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 5); CartesianIndex(2, 5); … ; CartesianIndex(4, 5); CartesianIndex(5, 5);;], 100)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(5)
0.002766 seconds (15.98 k CPU allocations: 899.406 KiB) (404 GPU allocations: 23.906 KiB, 11.85% memmgmt time)
([1.8333333333333335; 1.875; … ; 1.9583333333333335; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 5); CartesianIndex(2, 5); … ; CartesianIndex(4, 5); CartesianIndex(5, 5);;], 100)

julia> CUDA.@time CUDA.@sync V, idx, iter = main(2^13)
0.755437 seconds (57.94 k CPU allocations: 515.690 MiB, 1.44% gc time) (404 GPU allocations: 1.037 GiB, 0.28% memmgmt time)
([1.9998779445868424; 1.9998779594880038; … ; 1.9999999850988386; 2.0;;], CartesianIndex{2}[CartesianIndex(1, 8192); CartesianIndex(2, 8192); … ; CartesianIndex(8191, 8192); CartesianIndex(8192, 8192);;], 100)
``````

I know it needs to spend some time compiling for the 1st run in Julia, so I run the same function several times at least to identify.