Why does GPU addition slows down as the array get larger compared to other methods?

Consider the three functions below which add two arrays sequentially, using threads, and using GPU:

using BenchmarkTools
using CUDA

function sequential_add!(y, x)
    for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

function parallel_add!(y, x)
    Threads.@threads for i in eachindex(y, x)
        @inbounds y[i] += x[i]
    end
    return nothing
end

function add_broadcast!(y, x)
    CUDA.@sync y .+= x
    return
end

I ran the code below for array sizes ranging from 10^8 to 5*10^8 and saved the running time:

const num_trials = 5
res = zeros((num_trials,3));
for i in range(start=1, step=1, stop = num_trials)
    println(i)
    N = i*10^8
    x_seq = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
    y_seq = fill(2.0f0, N);  # a vector filled with 2.0
    
    x_thred = fill(1.0f0, N)  # a vector filled with 1.0 (Float32)
    y_thred = fill(2.0f0, N);  # a vector filled with 2.0
    
    x_d = CUDA.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
    y_d = CUDA.fill(2.0f0, N);  # a vector stored on the GPU filled with 2.0
        
    t_seq = @benchmark sequential_add!($y_seq, $x_seq)
    res[i, 1] = mean(t_seq).time
    
    t_thred = @benchmark parallel_add!($y_thred, $x_thred)
    res[i,2] = mean(t_thred).time
    
    t_cuda = @benchmark add_broadcast!($y_d, $x_d)
    res[i,3] = mean(t_cuda).time

end

Here is the result, the y axis is time in milli second and the x axis is array size divided by 10^8:

Why does the GPU related computations slows down compared to the other two after some array sizes? Is it because now it takes more time for the data, the array, to go to GPU?

So you’re suggesting that a broadcast operation of 5*10^8 items takes longer on the GPU than it does on the CPU? That’s obviously not expected, and I cannot reproduce even on the lowest-end GPU I have. Here’s the timings on my workstation:

julia> N = 5*10^8;

julia> x_seq = fill(1.0f0, N);

julia> y_seq = fill(2.0f0, N);

julia> x_d = CUDA.fill(1.0f0, N);

julia> y_d = CUDA.fill(2.0f0, N);

julia> @benchmark y_seq .+= x_seq
BenchmarkTools.Trial: 29 samples with 1 evaluation.
 Range (min … max):  172.270 ms … 180.975 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     172.435 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   173.179 ms Β±   1.840 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–
  β–ˆβ–ˆβ–†β–β–ƒβ–ƒβ–β–β–β–ƒβ–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–ƒβ–β–β–β–ƒβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ƒ ▁
  172 ms           Histogram: frequency by time          181 ms <

 Memory estimate: 32 bytes, allocs estimate: 1.

julia> @benchmark CUDA.@sync y_d .+= x_d
BenchmarkTools.Trial: 649 samples with 1 evaluation.
 Range (min … max):  7.454 ms …   7.958 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     7.712 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   7.705 ms Β± 118.687 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

              β–„β–‡β–ˆ                ▁ β–‚β–‚β–ƒ
  β–β–‚β–‚β–β–β–β–β–‚β–ƒβ–ƒβ–„β–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–…β–…β–ƒβ–„β–†β–„β–„β–…β–…β–„β–…β–†β–„β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–„β–‡β–‡β–…β–†β–†β–…β–†β–‡β–„β–†β–ƒβ–ƒβ–ƒβ–‚β–ƒβ–„β–‡β–…β–‡β–ƒβ–… β–ƒ
  7.45 ms         Histogram: frequency by time        7.93 ms <

 Memory estimate: 1.42 KiB, allocs estimate: 26.

Since you’re not actually including the time to copy memory, I cannot imagine a GPU broadcast taking longer than a CPU broadcast, especially with large inputs. Did you investigate yourself? What GPU are you using?

2 Likes

I have NVIDIA GeForce RTX 3050 Laptop GPU.
I realized if I run the code for GPU for the input with N = 5*10^5 after a fresh restart of the notebook the timing is different than the case I run the same code after the inputs [1,2,3,4]*10^8 in a loop. In the fresh start I get around 100 ms for N=5*10^8 while in the loop for N=5*10^8 I get around 350 ms.
Moreover, after a fresh start and running the code for N=5*10^8 for multiple times every other iteration I get 100 ms and 600 ms. That is the first time 100 ms, the second time 600 ms, the third time 100 ms and so on.

So could it be a driver issue or anything from the GPU?

N = 5*10^8
x_d = CUDA.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N);  # a vector stored on the GPU filled with 2.0
t_cuda = @benchmark add_broadcast!($y_d, $x_d)

BenchmarkTools.Trial: 50 samples with 1 evaluation.
 Range (min … max):  101.137 ms … 101.689 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     101.265 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   101.348 ms Β± 175.231 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–„ β–β–ˆ ▁ ▁ ▁▄▄  β–„                 ▁    β–„      β–„                  
  β–ˆβ–†β–ˆβ–ˆβ–†β–ˆβ–β–ˆβ–β–ˆβ–ˆβ–ˆβ–β–†β–ˆβ–†β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–†β–ˆβ–†β–β–β–†β–ˆβ–†β–β–†β–†β–†β–†β–ˆβ–β–†β–β–β–β–β–β–β–β–†β–†β–β–†β–β–†β–† ▁
  101 ms           Histogram: frequency by time          102 ms <

========================
Second attempt
========================

N = 5*10^8
x_d = CUDA.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N);  # a vector stored on the GPU filled with 2.0
t_cuda = @benchmark add_broadcast!($y_d, $x_d)

BenchmarkTools.Trial: 9 samples with 1 evaluation.
 Range (min … max):  603.774 ms … 608.049 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     603.979 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   604.436 ms Β±   1.365 ms  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–β–ˆβ–ˆβ–β–   ▁                                                   ▁  
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–β–β–β–ˆβ–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–ˆ ▁
  604 ms           Histogram: frequency by time          608 ms <

That still seems way too slow. I take it your GPU is also driving your output? Maybe you have some processes running (e.g. browsers) that consume parts of the GPU, either memory or compute resources?

Is it simply a RAM issue? A RTX 3050 Laptop GPU apparently has 4GB of RAM, while the data size you specify is exactly 4GB? How does for instance N = 2*10^8 perform in the benchmark? Looking at your graph the issue doesn’t appear until you exceed ~75% of RAM.

I use Windows 11. Using the task manager I do not see any utilization or memory consumption before running the code.

For N=2*10^8 it takes around 15 ms and this time no matter how many times I run the code, the timings are consistent. So as you suggested it might have something to do with the memory capacity (which in my case is 4 GB).

I also have an AMD Radeon Graphics on the laptop. The shared memory is 14 GB which plus the 4GB of dedicated memory makes it 18 GB. Could the shared memory has some effects? I do not think the code uses the shared memory since if I increase the array size I get an error which says GPU has 4GB of RAM and the array is too large.