I marked all the scalar bits as you suggested, and still get very slow timing and huge memory allocation:
julia> CUDA.allowscalar(false)
julia> @btime subsetSumCuArrays($s, $S)
2.762 s (898 allocations: 7.31 GiB)
false
Here’s how I modified the function:
function subsetSumCuArrays(s, S)
n = length(s)
F_d = CUDA.zeros(Int8, S+1, n)
s_d = CuArray{Int64,1}(s)
F_d[1,:] .= 1
CUDA.@allowscalar(s_d[1]≤ S && (F_d[s_d[1]+1,1] = 1))
@views for j in 2:n
F_d[2:S+1,j] .= F_d[2:S+1,j-1]
if(CUDA.@allowscalar(s_d[j] <= S))
F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .= F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .| F_d[1:CUDA.@allowscalar(S+1-s_d[j]),j-1]
end
end
synchronize()
return Bool(CUDA.@allowscalar(F_d[S+1,n]))
end