Why is this CUDA interpolation kernel scaling poorly?

Iโ€™m quite new to CUDA programming with Julia. I have been trying to write a CUDA kernel for doing a simple 1D interpolation on a uniform logarithmic grid - here is the function:

function CUDA_interpolate!(out,ydata,E)

    # Note: Emag needs to be going in here, to ensure we are within bounds
    # (we'll be doing Emag as part of the full, fused kernel)

    # compute index and stride:
    index = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    stride = gridDim().x * blockDim().x

    for i = index:stride:length(out)
        # do the interpolation
        @inbounds begin
        Elog = log10(E[i])
        node_a = Int(cld(((Elog) - -3.05),0.0025))

        # limit node_a, to prevent OOB error if E is too large:
        if Elog > 2.0
            node_a = length(ydata) - 1
        end

        out[i] = ydata[node_a] + (Elog + 3.05 - ((node_a - 1) * 0.0025))*(ydata[node_a+1] - ydata[node_a])/(0.0025)

        # modify output if Elog is 'OOB':        
        if Elog < -3.05
            out[i] = ydata[1]
        end
        end
    end

    return nothing
end

However, it seems to be scaling poorly as I increase the size of the array I am interpolating over. When E is length 2000, the kernel is benchmarking at around 11 us:


Loading packages...
Initializing streamer params ...
Creating GPU arrays.
Compiling kernel.
For this GPU kernel, we need 2 blocks, with 1024 threads each.
The error 2-norm is 0.07131546503928046.
Benchmarking the CPU method:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  18.402 ฮผs โ€ฆ 60.124 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     18.649 ฮผs              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   18.792 ฮผs ยฑ  1.449 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

     โ–†โ–ˆโ–ƒ                                                       
  โ–‚โ–ƒโ–‡โ–ˆโ–ˆโ–ˆโ–…โ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚ โ–‚
  18.4 ฮผs         Histogram: frequency by time        21.7 ฮผs <

 Memory estimate: 15.81 KiB, allocs estimate: 1.
Benchmarking the GPU method:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min โ€ฆ max):  10.659 ฮผs โ€ฆ 36.889 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     10.941 ฮผs              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   11.030 ฮผs ยฑ  1.039 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

           โ–‚โ–„โ–†โ–†โ–ˆโ–‡โ–…โ–„โ–ƒโ–‚          โ–โ–โ–                             
  โ–โ–โ–โ–โ–‚โ–ƒโ–„โ–†โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–ˆโ–‡โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–†โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ– โ–„
  10.7 ฮผs         Histogram: frequency by time        11.5 ฮผs <

 Memory estimate: 1.16 KiB, allocs estimate: 22.
Profiler ran for 48.16 ยตs, capturing 46 events.

Host-side activity: calling CUDA APIs took 21.22 ยตs (44.06% of the trace)
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ID โ”‚    Start โ”‚      Time โ”‚ Name                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  2 โ”‚ 21.22 ยตs โ”‚  16.45 ยตs โ”‚ cuLaunchKernel      โ”‚
โ”‚ 44 โ”‚ 45.54 ยตs โ”‚ 715.26 ns โ”‚ cuStreamSynchronize โ”‚
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Device-side activity: GPU was busy for 8.34 ยตs (17.33% of the trace)
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚ ID โ”‚    Start โ”‚    Time โ”‚ Threads โ”‚ Blocks โ”‚ Regs โ”‚ Name                                             โ‹ฏ
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚  2 โ”‚ 36.24 ยตs โ”‚ 8.34 ยตs โ”‚    1024 โ”‚      2 โ”‚   32 โ”‚ _Z17CUDA_interpolate_13CuDeviceArrayI7Float32Li1 โ‹ฏ
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

However, when I increase the E input array size to closer to 10 million, it takes a lot longer (2.6 ms):

Loading packages...
Initializing streamer params ...
Creating GPU arrays.
Compiling kernel.
For this GPU kernel, we need 9766 blocks, with 1024 threads each.
The error 2-norm is 4.985414333197956.
Benchmarking the CPU method:
BenchmarkTools.Trial: 52 samples with 1 evaluation.
 Range (min โ€ฆ max):  95.782 ms โ€ฆ 102.007 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 5.50%
 Time  (median):     96.279 ms               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   97.418 ms ยฑ   1.812 ms  โ”Š GC (mean ยฑ ฯƒ):  1.33% ยฑ 1.82%

   โ–โ–โ–ˆโ–ƒโ–ƒโ–ƒโ–  โ–                                            โ–โ–     
  โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–„โ–„โ–ˆโ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–โ–„โ–โ–โ–„โ–‡โ–‡โ–ˆโ–ˆโ–‡โ–‡โ–„ โ–
  95.8 ms         Histogram: frequency by time          100 ms <

 Memory estimate: 76.29 MiB, allocs estimate: 2.
Benchmarking the GPU method:
BenchmarkTools.Trial: 1888 samples with 1 evaluation.
 Range (min โ€ฆ max):  2.610 ms โ€ฆ 2.682 ms  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     2.643 ms             โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   2.644 ms ยฑ 4.568 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

                                 โ–‚โ–ˆโ–†โ–‚โ–                       
  โ–‚โ–โ–‚โ–โ–โ–โ–โ–โ–โ–โ–โ–‚โ–โ–โ–โ–โ–โ–โ–‚โ–‚โ–โ–โ–โ–โ–โ–‚โ–โ–‚โ–ƒโ–„โ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–„โ–ƒโ–‚โ–‚โ–‚โ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚ โ–ƒ
  2.61 ms        Histogram: frequency by time       2.67 ms <

 Memory estimate: 1.19 KiB, allocs estimate: 24.
Profiler ran for 2.78 ms, capturing 520 events.

Host-side activity: calling CUDA APIs took 2.5 ms (89.91% of the trace)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ID โ”‚     Start โ”‚     Time โ”‚ Thread โ”‚ Name                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   2 โ”‚   21.7 ยตs โ”‚ 17.64 ยตs โ”‚      1 โ”‚ cuLaunchKernel      โ”‚
โ”‚ 518 โ”‚ 223.16 ยตs โ”‚   2.5 ms โ”‚      2 โ”‚ cuStreamSynchronize โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Device-side activity: GPU was busy for 2.68 ms (96.47% of the trace)
โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚ ID โ”‚    Start โ”‚    Time โ”‚ Threads โ”‚ Blocks โ”‚ Regs โ”‚ Name                                             โ‹ฏ
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
โ”‚  2 โ”‚ 38.86 ยตs โ”‚ 2.68 ms โ”‚    1024 โ”‚   9766 โ”‚   32 โ”‚ _Z17CUDA_interpolate_13CuDeviceArrayI7Float32Li1 โ‹ฏ
โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

In this case, the number of blocks has been increased, so that I still (apparently) have a dedicated thread for each array entry, so I am confused as to why it is taking so much longer to compute. I could understand some increased overhead in coordinating 10 million GPU threads, rather than 2000, but with this scaling, there seems to be minimal benefit to adding all those threads. I am sure there is something I am missing - where is the time going here?

(note that this is using a GTX 1080ti and I am using Float32s)

Ok, I think I understand it now. The key piece of info I hadnโ€™t realized is that each CUDA core can run a single thread at any one time. The GTX 1080ti has 3580 CUDA cores. So, with the 2000-length array, all threads can run concurrently. However, for the 10-million-length array example, each CUDA core will have to run around 2800 threads sequentially to process all the entries.

That kind of lines up with the benchmark times. 8 us x 2800 is 23 ms. The actual computation at 2.6 ms was about 10x faster than that - presumably because the 8 us for the 2000-long example was mostly kernel launch time, or the cores are using some kind of staggered pipelining.

1 Like