Julia version/CUDA compatibility with Quadro K4100 compute capbility of 3

Hi,
I’m really struggling to find a compatible combination of versions: nVidia TK, Julia, CUDA.jl to execute a simple linear shift and add (see fct below) to efficiently run on my old Quadro K4100m GPU with compute capability of 3.

julia> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4940MX CPU @ 3.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

julia> CUDA.device()
CuDevice(0): Quadro K4100M
function saad(Sd::CuArray{T}, Id::CuArray{T}, d::Int) where {T<:UInt16}
    Sd[1+d:end] .+= Id[1:end-d]
end

Found TK 10.1 w driver 418 is the latest for compute capability 3
Currently running Julia 1.5.4 CUDA@v1.3.3
GPU runs 100% but perf are less than similar code using a single thread loop on CPU.

function saa(S::Array{T}, I::Array{T}, d::Int) where {T<:UInt16}
    n = length(S)
    for i=1+d:n
        @inbounds S[i] += I[i-d]
    end
end

I suspect broadcast and thread usage is not optimal on GPU ?
So far cannot find any compatibility doc to combine Julia and CUDA version to optimally use this old GPU.

Thanks for your guidance
The following are my test data

using BenchmarkTools
n = 60 * 10^6
S = zeros(UInt16, n)
Sd = CuArray(S)
I = rand(UInt16.(1:9), n)
Id = CuArray(I)
d = 1
iter = 720

You’re broadcasting a much too simple operation, and the GPU needs some arithmetic complexity to hide the latency of memory operations.