Performance regression with GPUArrays subset sum

eaubanel · November 25, 2020, 7:27pm

I’ve been away from Julia for several months, and have been rerunning GPU code from this past summer on a new installation of Jula 1.5.3 and CUDA.jl (previously I used the separate CUDA packages). My subset sum program now runs much slower (3 1/3 sec compared to 159 ms):

julia> s = rand(1:10000000,10);
julia> S = (10000000*10)÷4;
julia> @btime subsetSumCuArrays($s, $S)
  3.384 s (901 allocations: 7.47 GiB)
false

Clearly it’s the huge amount of memory being allocated (compared to 126.64 KiB previously). I can’t figure out why so much memory is being allocated. Has something changed perhaps with the @views macro? Can anyone help?

Here’s the code:

function subsetSumCuArrays(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    s_d[1]≤ S && (F_d[s_d[1]+1,1] = 1)
    @views for j in 2:n
      F_d[2:S+1,j] .=  F_d[2:S+1,j-1]
      if(s_d[j] <= S)
        F_d[s_d[j]+1:S+1,j] .= F_d[s_d[j]+1:S+1,j] .| F_d[1:S+1-s_d[j],j-1]
      end
    end
    synchronize()
    return Bool(F_d[S+1,n])
end

Here's my version info:
julia> CUDA.versioninfo()
CUDA toolkit 10.2.89, local installation
CUDA driver 10.2.0
NVIDIA driver 440.33.1

Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 10.0.0+440.33.1
- CUDNN: 8.0.5 (for CUDA 10.2.0)
- CUTENSOR: 1.2.1 (for CUDA 10.2.0)

Toolchain:
- Julia: 1.5.3
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

16 devices:
  0: Tesla V100-SXM3-32GB (sm_70, 31.365 GiB / 31.749 GiB available)
...

maleadt · November 25, 2020, 9:07pm

Did you disable scalar iteration? Workflow · CUDA.jl

eaubanel · November 26, 2020, 4:17pm

Thanks for the suggestion. I know I’ve got scalar iteration on line 6:

and in the if statement in the loop , where I access s_d[j], but I don’t see how it would affect performance so much, and also why it gave me good performance previously

maleadt · November 26, 2020, 4:43pm

You better mark those explicitly using @allowscalar(...) then so that you can still globally disable it to make sure there’s no other code paths that accidentally hit it.

eaubanel · November 26, 2020, 5:48pm

I marked all the scalar bits as you suggested, and still get very slow timing and huge memory allocation:

julia> CUDA.allowscalar(false)
julia> @btime subsetSumCuArrays($s, $S)
  2.762 s (898 allocations: 7.31 GiB)
false

Here’s how I modified the function:

function subsetSumCuArrays(s, S)
    n = length(s)
    F_d = CUDA.zeros(Int8, S+1, n)
    s_d = CuArray{Int64,1}(s)
    F_d[1,:] .= 1
    CUDA.@allowscalar(s_d[1]≤ S && (F_d[s_d[1]+1,1] = 1))
    @views for j in 2:n
      F_d[2:S+1,j] .=  F_d[2:S+1,j-1]
      if(CUDA.@allowscalar(s_d[j] <= S))
        F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .= F_d[CUDA.@allowscalar(s_d[j]+1):S+1,j] .| F_d[1:CUDA.@allowscalar(S+1-s_d[j]),j-1]
      end
    end
    synchronize()
    return Bool(CUDA.@allowscalar(F_d[S+1,n]))
end

maleadt · November 27, 2020, 7:44am

OK, that looks like a bug then, please file an issue (with a fully reproducible example, etc). You could always try to reduce some more or profile individual operations to see where the issue is.

eaubanel · November 27, 2020, 12:09pm

Thanks, I’ll do that, and come up with a simpler example

eaubanel · December 4, 2020, 3:18pm

After some testing with simpler examples I finally hit on the problem, which seems to have something to do with bounds checking. When I add @inbounds after @views at the beginning of the for loop, the time and memory usage drop dramatically:

4.366 ms (743 allocations: 21.34 KiB)

So, it doesn’t appear to be a bug after all. Curious that this problem didn’t surface with the older version of CuArrays

maleadt · December 5, 2020, 7:55pm

That’s probably https://github.com/JuliaGPU/CUDA.jl/pull/404 – if you think the bounds checking regressed, please open an issue.

eaubanel · December 9, 2020, 7:20pm

Thanks for prompting me, I just did submit an issue

Topic		Replies	Views
Dot-product of CuArray views is slow GPU performance , memory-allocation , views	10	1539	May 11, 2021
Reduce memory allocated in array view and in place sum Performance question	12	693	November 10, 2023
How to avoid memory allocation while doing sum on a GPU? General Usage cuda , memory-allocation , cudajl	7	126	April 20, 2025
Sum is very slow (and I can't figure out why) GPU	4	927	January 4, 2021
Correct implementation of CuArray's slicing operations GPU	3	587	October 31, 2023

Performance regression with GPUArrays subset sum

Related topics