Why Julia is much slower than MATLAB on GPU computing?

maleadt · November 20, 2023, 7:00am

Only takes 0.5s on my RTX6000, which is faster than your RTX3090 but not 20x.

You should provide additional information for people to be able to help you, e.g., the CUDA.jl version (by showing CUDA.versioninfo()), running under CUDA.@time and CUDA.@profile to provide some minimal timing information, etc. Also try to use CUDA.jl#master. Example output here:

julia> CUDA.@time main()
  0.529043 seconds (20.34 k CPU allocations: 513.009 MiB, 0.79% gc time) (502 GPU allocations: 50.537 GiB, 63.05% memmgmt time)

julia> CUDA.@profile main()
Profiler ran for 536.75 ms, capturing 8100 events.

Host-side activity: calling CUDA APIs took 330.32 ms (61.54% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                    │ Name                    │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│   61.54% │  330.32 ms │     2 │ 165.16 ms ± 233.57 (   0.0 ‥ 330.32) │ cuStreamSynchronize     │
│    8.75% │   46.97 ms │     1 │                                      │ cuMemcpyHtoDAsync       │
│    0.30% │    1.61 ms │   505 │   3.18 µs ± 19.6   (  0.95 ‥ 395.54) │ cuMemAllocFromPoolAsync │
│    0.30% │    1.59 ms │   501 │   3.17 µs ± 1.24   (  2.38 ‥ 20.03)  │ cuLaunchKernel          │
│    0.10% │  537.63 µs │   460 │   1.17 µs ± 0.47   (  0.72 ‥ 9.06)   │ cuMemFreeAsync          │
│    0.01% │   43.39 µs │     3 │  14.46 µs ± 4.69   (  9.06 ‥ 17.4)   │ cuMemGetInfo            │
│    0.00% │   13.11 µs │     2 │   6.56 µs ± 3.54   (  4.05 ‥ 9.06)   │ cuCtxSynchronize        │
│    0.00% │    1.43 µs │     6 │ 238.42 ns ± 150.79 (   0.0 ‥ 476.84) │ cuMemPoolGetAttribute   │
│    0.00% │  715.26 ns │     9 │  79.47 ns ± 119.21 (   0.0 ‥ 238.42) │ cuDriverGetVersion      │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘

Device-side activity: GPU was busy for 405.64 ms (75.57% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                  │ Name                                                                                                                                          ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   43.87% │  235.48 ms │   100 │   2.35 ms ± 0.05   (  2.25 ‥ 2.44) │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI7Float645Int64E16CartesianIndicesILi2ES2_I5OneToIS4_ES6_IS4_EEES5_ILi2ES2_IS6_IS ⋯
│   24.17% │  129.71 ms │    99 │   1.31 ms ± 0.01   (  1.29 ‥ 1.32) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    7.19% │    38.6 ms │     1 │                                    │ [copy pageable to device memory]                                                                                                              ⋯
│    0.24% │    1.31 ms │     1 │                                    │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│    0.04% │  193.36 µs │   100 │   1.93 µs ± 0.18   (  1.67 ‥ 2.38) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE3_31S4_ ⋯
│    0.04% │  191.21 µs │   100 │   1.91 µs ± 0.17   (  1.67 ‥ 2.15) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI14CartesianIndexILi2EELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5 ⋯
│    0.03% │  142.34 µs │    99 │   1.44 µs ± 0.18   (  1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_IS ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z2_615CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EES1_                                                                                    ⋯
│    0.00% │    1.43 µs │     1 │                                    │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE2_9I1_ES4_IS1_8 ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Regarding the memory usage: Julia being a GC’d language will always consume more memory, but the large difference is likely caused by CUDA.jl using a memory pool (which causes freed objects not being visible as freed memory; use CUDA.memory_status() if you want to differentiate between used and cached memory).

Topic		Replies	Views
GPU Julia vs GPU Matlab New to Julia gpu	61	956	November 18, 2024
Matlab versus Julia General Usage	33	4863	July 15, 2021
My julia code is somehow much slower than the matlab code New to Julia question , performance , matlab	55	4268	December 30, 2022
Some CUDA functions suddenly become very slow New to Julia	3	188	July 14, 2024
Why CUDA is so slow on y = x*w + w0? GPU	1	234	March 28, 2024

Why Julia is much slower than MATLAB on GPU computing?

Related topics