Hi,
I’m really struggling to find a compatible combination of versions: nVidia TK, Julia, CUDA.jl to execute a simple linear shift and add (see fct below) to efficiently run on my old Quadro K4100m GPU with compute capability of 3.
julia> versioninfo()
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4940MX CPU @ 3.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, haswell)
Environment:
JULIA_NUM_THREADS = 4
julia> CUDA.device()
CuDevice(0): Quadro K4100M
function saad(Sd::CuArray{T}, Id::CuArray{T}, d::Int) where {T<:UInt16}
Sd[1+d:end] .+= Id[1:end-d]
end
Found TK 10.1 w driver 418 is the latest for compute capability 3
Currently running Julia 1.5.4 CUDA@v1.3.3
GPU runs 100% but perf are less than similar code using a single thread loop on CPU.
function saa(S::Array{T}, I::Array{T}, d::Int) where {T<:UInt16}
n = length(S)
for i=1+d:n
@inbounds S[i] += I[i-d]
end
end
I suspect broadcast and thread usage is not optimal on GPU ?
So far cannot find any compatibility doc to combine Julia and CUDA version to optimally use this old GPU.
Thanks for your guidance
The following are my test data
using BenchmarkTools
n = 60 * 10^6
S = zeros(UInt16, n)
Sd = CuArray(S)
I = rand(UInt16.(1:9), n)
Id = CuArray(I)
d = 1
iter = 720