Performance regression with GPUArrays subset sum

I marked all the scalar bits as you suggested, and still get very slow timing and huge memory allocation:

julia> CUDA.allowscalar(false)
julia> @btime subsetSumCuArrays($s, $S)
  2.762 s (898 allocations: 7.31 GiB)
false

Here’s how I modified the function:

function subsetSumCuArrays(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    CUDA.@allowscalar(s_d[1]≤ S && (F_d[s_d[1]+1,1] = 1))
    @views for j in 2:n
      F_d[2:S+1,j] .=  F_d[2:S+1,j-1]
      if(CUDA.@allowscalar(s_d[j] <= S))
        F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .= F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .| F_d[1:CUDA.@allowscalar(S+1-s_d[j]),j-1]
      end
    end
    synchronize()
    return Bool(CUDA.@allowscalar(F_d[S+1,n]))
end