Only takes 0.5s on my RTX6000, which is faster than your RTX3090 but not 20x.
You should provide additional information for people to be able to help you, e.g., the CUDA.jl version (by showing CUDA.versioninfo()
), running under CUDA.@time
and CUDA.@profile
to provide some minimal timing information, etc. Also try to use CUDA.jl#master. Example output here:
julia> CUDA.@time main()
0.529043 seconds (20.34 k CPU allocations: 513.009 MiB, 0.79% gc time) (502 GPU allocations: 50.537 GiB, 63.05% memmgmt time)
julia> CUDA.@profile main()
Profiler ran for 536.75 ms, capturing 8100 events.
Host-side activity: calling CUDA APIs took 330.32 ms (61.54% of the trace)
┌──────────┬────────────┬───────┬──────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────────┼─────────────────────────┤
│ 61.54% │ 330.32 ms │ 2 │ 165.16 ms ± 233.57 ( 0.0 ‥ 330.32) │ cuStreamSynchronize │
│ 8.75% │ 46.97 ms │ 1 │ │ cuMemcpyHtoDAsync │
│ 0.30% │ 1.61 ms │ 505 │ 3.18 µs ± 19.6 ( 0.95 ‥ 395.54) │ cuMemAllocFromPoolAsync │
│ 0.30% │ 1.59 ms │ 501 │ 3.17 µs ± 1.24 ( 2.38 ‥ 20.03) │ cuLaunchKernel │
│ 0.10% │ 537.63 µs │ 460 │ 1.17 µs ± 0.47 ( 0.72 ‥ 9.06) │ cuMemFreeAsync │
│ 0.01% │ 43.39 µs │ 3 │ 14.46 µs ± 4.69 ( 9.06 ‥ 17.4) │ cuMemGetInfo │
│ 0.00% │ 13.11 µs │ 2 │ 6.56 µs ± 3.54 ( 4.05 ‥ 9.06) │ cuCtxSynchronize │
│ 0.00% │ 1.43 µs │ 6 │ 238.42 ns ± 150.79 ( 0.0 ‥ 476.84) │ cuMemPoolGetAttribute │
│ 0.00% │ 715.26 ns │ 9 │ 79.47 ns ± 119.21 ( 0.0 ‥ 238.42) │ cuDriverGetVersion │
└──────────┴────────────┴───────┴──────────────────────────────────────┴─────────────────────────┘
Device-side activity: GPU was busy for 405.64 ms (75.57% of the trace)
┌──────────┬────────────┬───────┬────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 43.87% │ 235.48 ms │ 100 │ 2.35 ms ± 0.05 ( 2.25 ‥ 2.44) │ _Z22partial_mapreduce_grid8identity9reductionI6islessE5TupleI7Float645Int64E16CartesianIndicesILi2ES2_I5OneToIS4_ES6_IS4_EEES5_ILi2ES2_IS6_IS ⋯
│ 24.17% │ 129.71 ms │ 99 │ 1.31 ms ± 0.01 ( 1.29 ‥ 1.32) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│ 7.19% │ 38.6 ms │ 1 │ │ [copy pageable to device memory] ⋯
│ 0.24% │ 1.31 ms │ 1 │ │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_I8 ⋯
│ 0.04% │ 193.36 µs │ 100 │ 1.93 µs ± 0.18 ( 1.67 ‥ 2.38) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE3_31S4_ ⋯
│ 0.04% │ 191.21 µs │ 100 │ 1.91 µs ± 0.17 ( 1.67 ‥ 2.15) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI14CartesianIndexILi2EELi2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5 ⋯
│ 0.03% │ 142.34 µs │ 99 │ 1.44 µs ± 0.18 ( 1.19 ‥ 1.67) │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li2ELi1EE11BroadcastedI12CuArrayStyleILi2EE5TupleI5OneToI5Int64ES5_IS6_EE1_S4_IS ⋯
│ 0.00% │ 1.43 µs │ 1 │ │ _Z2_615CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EES1_ ⋯
│ 0.00% │ 1.43 µs │ 1 │ │ _Z16broadcast_kernel15CuKernelContext13CuDeviceArrayI7Float64Li1ELi1EE11BroadcastedI12CuArrayStyleILi1EE5TupleI5OneToI5Int64EE2_9I1_ES4_IS1_8 ⋯
└──────────┴────────────┴───────┴────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Regarding the memory usage: Julia being a GC’d language will always consume more memory, but the large difference is likely caused by CUDA.jl using a memory pool (which causes freed objects not being visible as freed memory; use CUDA.memory_status()
if you want to differentiate between used and cached memory).