Performance regression with GPUArrays subset sum

I’ve been away from Julia for several months, and have been rerunning GPU code from this past summer on a new installation of Jula 1.5.3 and CUDA.jl (previously I used the separate CUDA packages). My subset sum program now runs much slower (3 1/3 sec compared to 159 ms):

julia> s = rand(1:10000000,10);
julia> S = (10000000*10)÷4;
julia> @btime subsetSumCuArrays($s, $S)
  3.384 s (901 allocations: 7.47 GiB)
false

Clearly it’s the huge amount of memory being allocated (compared to 126.64 KiB previously). I can’t figure out why so much memory is being allocated. Has something changed perhaps with the @views macro? Can anyone help?

Here’s the code:

function subsetSumCuArrays(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    s_d[1]≤ S && (F_d[s_d[1]+1,1] = 1)
    @views for j in 2:n
      F_d[2:S+1,j] .=  F_d[2:S+1,j-1]
      if(s_d[j] <= S)
        F_d[s_d[j]+1:S+1,j] .= F_d[s_d[j]+1:S+1,j] .| F_d[1:S+1-s_d[j],j-1]
      end
    end
    synchronize()
    return Bool(F_d[S+1,n])
end
Here's my version info:
julia> CUDA.versioninfo()
CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.33.1

Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.33.1
- CUDNN: 8.0.5 (for CUDA 10.2.0)
- CUTENSOR: 1.2.1 (for CUDA 10.2.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

16 devices:
  0: Tesla V100-SXM3-32GB (sm_70, 31.365 GiB / 31.749 GiB available)
...

Did you disable scalar iteration? https://juliagpu.github.io/CUDA.jl/stable/usage/workflow/#UsageWorkflowScalar

1 Like

Thanks for the suggestion. I know I’ve got scalar iteration on line 6:

and in the if statement in the loop , where I access s_d[j], but I don’t see how it would affect performance so much, and also why it gave me good performance previously

You better mark those explicitly using @allowscalar(...) then so that you can still globally disable it to make sure there’s no other code paths that accidentally hit it.

I marked all the scalar bits as you suggested, and still get very slow timing and huge memory allocation:

julia> CUDA.allowscalar(false)
julia> @btime subsetSumCuArrays($s, $S)
  2.762 s (898 allocations: 7.31 GiB)
false

Here’s how I modified the function:

function subsetSumCuArrays(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    CUDA.@allowscalar(s_d[1]≤ S && (F_d[s_d[1]+1,1] = 1))
    @views for j in 2:n
      F_d[2:S+1,j] .=  F_d[2:S+1,j-1]
      if(CUDA.@allowscalar(s_d[j] <= S))
        F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .= F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .| F_d[1:CUDA.@allowscalar(S+1-s_d[j]),j-1]
      end
    end
    synchronize()
    return Bool(CUDA.@allowscalar(F_d[S+1,n]))
end

OK, that looks like a bug then, please file an issue (with a fully reproducible example, etc). You could always try to reduce some more or profile individual operations to see where the issue is.

Thanks, I’ll do that, and come up with a simpler example

After some testing with simpler examples I finally hit on the problem, which seems to have something to do with bounds checking. When I add @inbounds after @views at the beginning of the for loop, the time and memory usage drop dramatically:

4.366 ms (743 allocations: 21.34 KiB)

So, it doesn’t appear to be a bug after all. Curious that this problem didn’t surface with the older version of CuArrays

That’s probably https://github.com/JuliaGPU/CUDA.jl/pull/404 – if you think the bounds checking regressed, please open an issue.

Thanks for prompting me, I just did submit an issue