Why is this CUDA interpolation kernel scaling poorly?

leestrobel · May 8, 2024, 3:53pm

I’m quite new to CUDA programming with Julia. I have been trying to write a CUDA kernel for doing a simple 1D interpolation on a uniform logarithmic grid - here is the function:

function CUDA_interpolate!(out,ydata,E)

    # Note: Emag needs to be going in here, to ensure we are within bounds
    # (we'll be doing Emag as part of the full, fused kernel)

    # compute index and stride:
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    for i = index:stride:length(out)
        # do the interpolation
        @inbounds begin
        Elog = log10(E[i])
        node_a = Int(cld(((Elog) - -3.05),0.0025))

        # limit node_a, to prevent OOB error if E is too large:
        if Elog > 2.0
            node_a = length(ydata) - 1
        end

        out[i] = ydata[node_a] + (Elog + 3.05 - ((node_a - 1) * 0.0025))*(ydata[node_a+1] - ydata[node_a])/(0.0025)

        # modify output if Elog is 'OOB':        
        if Elog < -3.05
            out[i] = ydata[1]
        end
        end
    end

    return nothing
end

However, it seems to be scaling poorly as I increase the size of the array I am interpolating over. When E is length 2000, the kernel is benchmarking at around 11 us:


Loading packages...
Initializing streamer params ...
Creating GPU arrays.
Compiling kernel.
For this GPU kernel, we need 2 blocks, with 1024 threads each.
The error 2-norm is 0.07131546503928046.
Benchmarking the CPU method:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  18.402 μs … 60.124 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.649 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.792 μs ±  1.449 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▆█▃                                                       
  ▂▃▇███▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  18.4 μs         Histogram: frequency by time        21.7 μs <

 Memory estimate: 15.81 KiB, allocs estimate: 1.
Benchmarking the GPU method:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  10.659 μs … 36.889 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.941 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.030 μs ±  1.039 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂▄▆▆█▇▅▄▃▂          ▁▁▁                             
  ▁▁▁▁▂▃▄▆█████████████▇▇▇█▇▇██████▇▇▆▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁ ▄
  10.7 μs         Histogram: frequency by time        11.5 μs <

 Memory estimate: 1.16 KiB, allocs estimate: 22.
Profiler ran for 48.16 µs, capturing 46 events.

Host-side activity: calling CUDA APIs took 21.22 µs (44.06% of the trace)
┌────┬──────────┬───────────┬─────────────────────┐
│ ID │    Start │      Time │ Name                │
├────┼──────────┼───────────┼─────────────────────┤
│  2 │ 21.22 µs │  16.45 µs │ cuLaunchKernel      │
│ 44 │ 45.54 µs │ 715.26 ns │ cuStreamSynchronize │
└────┴──────────┴───────────┴─────────────────────┘

Device-side activity: GPU was busy for 8.34 µs (17.33% of the trace)
┌────┬──────────┬─────────┬─────────┬────────┬──────┬───────────────────────────────────────────────────
│ ID │    Start │    Time │ Threads │ Blocks │ Regs │ Name                                             ⋯
├────┼──────────┼─────────┼─────────┼────────┼──────┼───────────────────────────────────────────────────
│  2 │ 36.24 µs │ 8.34 µs │    1024 │      2 │   32 │ _Z17CUDA_interpolate_13CuDeviceArrayI7Float32Li1 ⋯
└────┴──────────┴─────────┴─────────┴────────┴──────┴───────────────────────────────────────────────────

However, when I increase the E input array size to closer to 10 million, it takes a lot longer (2.6 ms):

Loading packages...
Initializing streamer params ...
Creating GPU arrays.
Compiling kernel.
For this GPU kernel, we need 9766 blocks, with 1024 threads each.
The error 2-norm is 4.985414333197956.
Benchmarking the CPU method:
BenchmarkTools.Trial: 52 samples with 1 evaluation.
 Range (min … max):  95.782 ms … 102.007 ms  ┊ GC (min … max): 0.00% … 5.50%
 Time  (median):     96.279 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   97.418 ms ±   1.812 ms  ┊ GC (mean ± σ):  1.33% ± 1.82%

   ▁▁█▃▃▃▁  ▁                                            ▁▁     
  ▇███████▄▄█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▇▇██▇▇▄ ▁
  95.8 ms         Histogram: frequency by time          100 ms <

 Memory estimate: 76.29 MiB, allocs estimate: 2.
Benchmarking the GPU method:
BenchmarkTools.Trial: 1888 samples with 1 evaluation.
 Range (min … max):  2.610 ms … 2.682 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.643 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.644 ms ± 4.568 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                 ▂█▆▂▁                       
  ▂▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▃▄▇█████▇▆▄▃▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.61 ms        Histogram: frequency by time       2.67 ms <

 Memory estimate: 1.19 KiB, allocs estimate: 24.
Profiler ran for 2.78 ms, capturing 520 events.

Host-side activity: calling CUDA APIs took 2.5 ms (89.91% of the trace)
┌─────┬───────────┬──────────┬────────┬─────────────────────┐
│  ID │     Start │     Time │ Thread │ Name                │
├─────┼───────────┼──────────┼────────┼─────────────────────┤
│   2 │   21.7 µs │ 17.64 µs │      1 │ cuLaunchKernel      │
│ 518 │ 223.16 µs │   2.5 ms │      2 │ cuStreamSynchronize │
└─────┴───────────┴──────────┴────────┴─────────────────────┘

Device-side activity: GPU was busy for 2.68 ms (96.47% of the trace)
┌────┬──────────┬─────────┬─────────┬────────┬──────┬───────────────────────────────────────────────────
│ ID │    Start │    Time │ Threads │ Blocks │ Regs │ Name                                             ⋯
├────┼──────────┼─────────┼─────────┼────────┼──────┼───────────────────────────────────────────────────
│  2 │ 38.86 µs │ 2.68 ms │    1024 │   9766 │   32 │ _Z17CUDA_interpolate_13CuDeviceArrayI7Float32Li1 ⋯
└────┴──────────┴─────────┴─────────┴────────┴──────┴───────────────────────────────────────────────────

In this case, the number of blocks has been increased, so that I still (apparently) have a dedicated thread for each array entry, so I am confused as to why it is taking so much longer to compute. I could understand some increased overhead in coordinating 10 million GPU threads, rather than 2000, but with this scaling, there seems to be minimal benefit to adding all those threads. I am sure there is something I am missing - where is the time going here?

(note that this is using a GTX 1080ti and I am using Float32s)

leestrobel · May 8, 2024, 8:10pm

Ok, I think I understand it now. The key piece of info I hadn’t realized is that each CUDA core can run a single thread at any one time. The GTX 1080ti has 3580 CUDA cores. So, with the 2000-length array, all threads can run concurrently. However, for the 10-million-length array example, each CUDA core will have to run around 2800 threads sequentially to process all the entries.

That kind of lines up with the benchmark times. 8 us x 2800 is 23 ms. The actual computation at 2.6 ms was about 10x faster than that - presumably because the 8 us for the 2000-long example was mostly kernel launch time, or the cores are using some kind of staggered pipelining.

Topic		Replies	Views
GPU kernel does not scale properly with data GPU question	10	171	December 17, 2024
CUDA.jl is slowed down after some number of iterations GPU cuda	9	252	December 22, 2024
CUDA \| nested loops kernel GPU question	5	166	May 12, 2025
Why is my GPU kernel an order of magnitude slower than my CPU function? GPU question	8	236	June 4, 2025
CUDA Speed drop Performance performance , cuda	4	452	August 23, 2020

Why is this CUDA interpolation kernel scaling poorly?

Related topics