How to accelerate GPU operation?

Hello, everyone in the Julia community. I’m currently writing a project about GPU. I hope to perform certain operations on some parts of a vector. These operations can be parallelized, but when I did it on the CPU, I found it to be very fast. However, when I moved it to the GPU, there was a significant speed drop. This might be due to too many kernel launches. May I ask if everyone has any optimization ideas?

using Pkg
Pkg.activate("test_cpu_env")
using CUDA
using LinearAlgebra
using Base.Threads
using Logging
Logging.with_logger(Logging.NullLogger()) do
    CUDA.allowscalar(true)
end

device = CUDA.device()
println("GPU Device Name: ", CUDA.name(device))
println("GPU Compute Capability: ", CUDA.capability(device))



N = 1000000 
cpu_array = rand(N)
gpu_array = CuArray(cpu_array)


function soc_proj_const_scale!(sol::CuArray)
    x2end = sol[2:end]
    t = sol[1]
    nrm_x = CUDA.norm(x2end)
    if nrm_x <= -t
        CUDA.fill!(sol, 0.0)
        return
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        sol[1] = nrm_x * c
        x2end .*= c
        return
    end
end

function soc_proj_const_scale!(sol::Vector{Float64})
    x2end = sol[2:end]
    t = sol[1]
    nrm_x = norm(x2end)
    if nrm_x <= -t
        fill!(sol, 0.0)
        return
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        sol[1] = nrm_x * c
        x2end .*= c
        return
    end
end

blkLen = Int(N / 1000)
n = Int(N / blkLen)
println("n: ", n)
slice = Vector{CuArray}(undef, blkLen)
slice_cpu = Vector{Vector{Float64}}(undef, blkLen)
for i in 1:blkLen
    slice[i] = @view gpu_array[(i - 1)*n + 1:i*n]
    slice_cpu[i] = @view cpu_array[(i - 1)*n + 1:i*n]
end

time_start = time()
@threads for i in 1:blkLen
    soc_proj_const_scale!(slice[i])
end
time_end = time()
println("Time: ", time_end - time_start)

time_start = time()
@threads for i in 1:blkLen
    soc_proj_const_scale!(slice_cpu[i])
end
time_end = time()
println("Time: ", time_end - time_start)

# return 
#  Activating project at `~/fork/pdhg_clp/test_cpu_env`
# GPU Device Name: NVIDIA A800 80GB PCIe
# GPU Compute Capability: 8.0.0
# n: 1000
# Time: 10.857582092285156
# Time: 0.12517809867858887

Hi,

Not directly related to optimisation, but worth mentioning is that in

x2end = sol[2:end]
...
x2end .*= c
return

you are making a copy (both in the cpu and gpu version), which you then modify in-place. These changes will not be reflected in the original sol array. You probably want to use a view, which would presumably additionally be a bit faster.

Also, it’s (in general) a bad idea to have a blanket CUDA.allowscalar(true) statement. It would be better to only locally use a CUDA.@allowscalar, i.e. CUDA.@allowscalar t = sol[1] and CUDA.@allowscalar sol[1] = nrm_x * c. In this example it does not matter, but CUDA.allowscalar(true) could obscure it when you accidentally execute code in a serial (i.e. slow) fashion.

As for some actual optimisation, in my intuition it would be easiest to work on a matrix instead of on slices, and to write vectorised code.

function soc_proj_const_scale!(solm::CuMatrix)
    x2ends = @view solm[2:end, :]
    ts = @view solm[1, :]
    nrm_xs = sqrt.(mapreduce(t -> t^2, +, x2ends, dims=1)) |> vec
    solm[:, nrm_xs .<= -ts] .= 0
    b = nrm_xs .> ts
    ts_b = ts[b]  # No view as cs_b line below will error
    nrm_xs_b = @view nrm_xs[b]
    cs_b = (1.f0 .+ ts_b ./ nrm_xs_b) ./ 2
    # Note that on the GPU you want to use 32-bit floats. 
    # (If solm uses Float64, 1.f0 will be promoted to Float64 anyway.)
    solm[1, b] .= nrm_xs_b .* cs_b
    solm[2:end, b] .*= cs_b'  # e.g. 999 x 123 .* 1 x 123
end

@btime (CUDA.@sync soc_proj_const_scale!(solm)) setup=(solm=stack(slice));
#   8.998 ms (3675 allocations: 110.53 KiB)
# Compared to something like 274.560 ms (116221 allocations: 2.95 MiB) for the old GPU version, with view

For the CPU version I get

julia> @btime (begin
            @threads for i in 1:blkLen
                soc_proj_const_scale!(s[i])  # (with view)
            end
       end) setup=(s=copy(slice_cpu))
  81.100 μs (44 allocations: 5.08 KiB)

which is still a lot faster (and much lower than what I got from a simple @time, or your measurement code, so perhaps I’m making some benchmarking error here). But N = 10^6 is not particularly large of course.

In any case, the GPU version can be optimised further: e.g. we don’t need the temporary array from the reduce, before the sqrt., we could reuse b also for nrm_xs .<= ts, we could probably improve the memory access order etc. For optimal performance you would probably need to write a custom kernel, but I’ll let you decide if that’s worth the effort.

Thank you for your response. I do not use matrices in this project because, in practical applications, the vectors have varying lengths. I apologize for any confusion this may have caused.

Agree! I have modify the useless copy by using view. But the result is also

GPU Device Name: NVIDIA A800 80GB PCIe
GPU Compute Capability: 8.0.0
n: 1000
Time: 9.503863096237183
Time: 0.08735108375549316

Thank you for your answer that use CUDA.@allowscalar locally. I will test the effect on my real project. I don’t know the difference between using it local and global. Thank you very much.

I’m pretty sure this includes complication, which can probably be ignored for the real project (assuming you run the function sufficiently many times). Compare these 9.5 s to the 275 ms I found using the same approach, on a weaker GPU (RTX 3070).

This is part of the problem, but not the dominating factor. For example, consider these two methods to compute the norm of each CuArray in slice.

julia> @btime CUDA.@sync norm.($slice);
  119.112 ms (10008 allocations: 164.45 KiB)

julia> @btime CUDA.@sync sqrt.(mapreduce(t -> t^2, +, $(stack(slice)), dims=1));
  137.100 μs (124 allocations: 3.45 KiB)

Of this first time, only some 5 ms is due to launching kernels (found using CUDA.@profile). Compared to the second method, this is clearly significant. But there’s much more going on to slow down the first method.

(For reference, on the cpu I get

julia> @btime sqrt.(mapreduce(t -> t^2, +, $(stack(slice_cpu)), dims=1));
  172.400 μs (2 allocations: 15.88 KiB)

julia> @btime norm.($slice_cpu);
  348.400 μs (1 allocation: 7.94 KiB)

.)

Dealing with data of varying lengths is not ideal for gpus. Is zero-padding the data to equal length an option?

Additionally, when using a Vector{CuVector} the data is not stored contiguously, which is also not ideal. A possible approach might be to keep the flattened data (as in gpu_array) and store the subarray boundaries (corresponding to the lengths, via a cumulative sum) as a CuVector{Int}. You could then write a first kernel to compute the norm of each subarray. Each thread, (conceptually) corresponding to an element in data, could add its value to the appropriate spot in an accumulator array, where this spot is determined by boundaries. But this is quite tricky because of reasons of synchronisation. Working atomically would be easiest, but will not be the most performant.

These kinds of reduction operations normally first write to shared memory, but that also does not seem easy when the subarrays have different lengths.

Alternatively, you might perform a complete cumulative sum on the squared entries in data and then take appropriate differences to obtain the squared norms of the subarrays. That does not seem too difficult, but I’m a bit concerned about numerical errors.

Interesting observation!So maybe data struct Vector{CuArray} is not efficient. I should consider other method to avoid it, like index array. Actually, the reason to use this struct is that I need to use view many times in cpu version of my project. So I store the view result into one Vector{CuArray}.

Thank you for your suggestion! My final goal is to do a multi block cone projection. For every small vector (different lengths), we need to do operation like norm, vector and vector elementwise product, scal and compare some scalar (for binary search one value), for different subvector, the operation logic are different.

My bad, I forgot that slice is constructed using views of contiguous data.


Another relevant point here is that in the second case the result is on the device (gpu), whereas in the first case it is on the host (cpu). So we need to copy a scalar from device to host 1000 times, which incurs significant overhead. The same is true in CUDA.@allowscalar t = sol[1] and CUDA.@allowscalar sol[1] = nrm_x * c (another good reason to avoid a global CUDA.allowscalar(true) :slight_smile: ).

julia> using BenchmarkTools, CUDA

julia> x = [1.f0];

julia> @btime CUDA.@sync Array(CuArray($x));  # (The sync is not needed)
  49.600 μs (16 allocations: 432 bytes).

So if you do this 1000 times like is the case in the soc_proj_const_scale! loop, you would spend 50 ms just copying over data. This is already over 5x slower than the entire stack CuMatrix approach!

Here’s that idea worked out.

using CUDA
using CUDA: i32
using LinearAlgebra
using Base.Threads
using BenchmarkTools

function generate_data(nb_slices=1000, min_slice_length=500, max_slice_length=1500)
	# Generate slices of contiguous data (views), of heterogeneous length.
	# Also output the underlying 1D data and slice lengths, both on GPU and CPU.
	slice_lengths_cpu = [rand(min_slice_length:max_slice_length) for _ in 1:nb_slices]
	slice_lengths = CuArray(slice_lengths_cpu)
	slice_lengths_cs_cpu = cumsum(slice_lengths_cpu)

	data = CUDA.rand(slice_lengths_cs_cpu[end])  # (or sum(slice_lengths_cpu) )
	data_cpu = Array(data)
	slices = [i == 1 ? @view(data[1:slice_lengths_cs_cpu[1]]) : @view(data[slice_lengths_cs_cpu[i - 1] + 1 : slice_lengths_cs_cpu[i]]) for i in eachindex(slice_lengths_cpu)]
	slices_cpu = [i == 1 ? @view(data_cpu[1:slice_lengths_cs_cpu[1]]) : @view(data_cpu[slice_lengths_cs_cpu[i - 1] + 1 : slice_lengths_cs_cpu[i]]) for i in eachindex(slice_lengths_cpu)]
	
	return data, slices, slice_lengths, data_cpu, slices_cpu, slice_lengths_cpu
end

# Original code
function soc_proj_const_scale!(sol::AbstractVector)
    x2end = @view sol[2:end]
    t = sol[1]
    nrm_x = norm(x2end)
    if nrm_x <= -t
        fill!(sol, 0.0)
        return
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        sol[1] = nrm_x * c
        x2end .*= c
        return
    end
end

function soc_proj_const_scale!(sol::CuVector)
    x2end = @view sol[2:end]
    CUDA.@allowscalar t = sol[1]
    nrm_x = CUDA.norm(x2end)
    if nrm_x <= -t
        CUDA.fill!(sol, 0.0)
        return
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        CUDA.@allowscalar sol[1] = nrm_x * c
        x2end .*= c
        return
    end
end

# Both on CPU and GPU
function soc_proj_const_scale_all!(slices)
	@threads for s in slices
		soc_proj_const_scale!(s)
    end
end

# Alternative GPU implementation
function soc_proj_const_scale_all!(data, slice_lengths)
	# data::CuVector{Float32}, slice_lengths::CuVector{Int32}
	data_sq = data .^ 2
	cs_data_sq = accumulate(+, data_sq)
	slice_ends = accumulate(+, slice_lengths)  # slices[i][end] == data[slice_ends[i]]
	slice_starts = CuVector{Int32}(undef, length(slice_lengths))  # slices[i][begin] == data[slice_starts[i]]
	CUDA.@allowscalar slice_starts[1] = 1i32
	slice_starts[2:end] .= @view(slice_ends[1:end-1]) .+ 1i32
	
	norms = CuVector{Float32}(undef, length(slice_lengths))  # When we're done: norms[i] == norm(slices[i][2:end]) (possibly some degree of numerical errors though)
	CUDA.@allowscalar norms[1] = sqrt.(cs_data_sq[slice_ends[1]] - data_sq[1])
	@views norms[2:end] .= sqrt.(cs_data_sq[slice_ends[2:end]] .- cs_data_sq[slice_ends[1:end-1]] .- data_sq[slice_starts[2:end]])
	
	data_idx_is_slice_start = CUDA.zeros(Int32, length(data))  
	data_idx_is_slice_start[slice_starts] .= 1i32
	data_slice_indices = accumulate(+, data_idx_is_slice_start)  # If data index i belongs to slice j, we have data_slice_indices[i] == j
	
	ts = @view data[slice_starts]
	
	# Find optimal number of threads per block
	kernel = @cuda launch=false soc_proj_const_scale_kernel!(data, ts, norms, data_idx_is_slice_start, data_slice_indices)
    config = launch_configuration(kernel.fun)
    @cuda blocks=max(cld(length(data), config.threads), config.blocks) threads=config.threads soc_proj_const_scale_kernel!(data, ts, norms, data_idx_is_slice_start, data_slice_indices)
end

function soc_proj_const_scale_kernel!(data, ts, norms, data_idx_is_slice_start, data_slice_indices) 
	data_idx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x
	while data_idx <= length(data)
		slice_idx = data_slice_indices[data_idx]
		nrm = norms[slice_idx]
		inv_nrm = 1.f0 / nrm
		t = ts[slice_idx]
		
		if nrm <= -t
			data[data_idx] = 0
		elseif nrm > t
			c = (1.f0 + t * inv_nrm) * 0.5f0
			if data_idx_is_slice_start[data_idx] == 1i32
				data[data_idx] = nrm * c
			else
				data[data_idx] *= c
			end
		end
		
		data_idx += blockDim().x * gridDim().x
	end
	return
end

device = CUDA.device()
println("GPU Device Name: ", CUDA.name(device))
println("GPU Compute Capability: ", CUDA.capability(device))

@benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths)) setup=(gen_data_tuple=generate_data(); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3]);  # New GPU version
@benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(slices=generate_data()[2]);  # Original GPU version
@benchmark (soc_proj_const_scale_all!(slices)) setup=(slices=generate_data()[5])  # (Original) CPU version

For 1000 slices of average length 1000 I get the following output

julia> println("GPU Device Name: ", CUDA.name(device))
GPU Device Name: NVIDIA GeForce RTX 3070

julia> println("GPU Compute Capability: ", CUDA.capability(device))
GPU Compute Capability: 8.6.0

julia> @benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths)) setup=(gen_data_tuple=generate_data(); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3])
BenchmarkTools.Trial: 1458 samples with 1 evaluation.
 Range (min … max):  1.479 ms … 121.805 ms  ┊ GC (min … max):  0.00% … 84.88%
 Time  (median):     1.881 ms               ┊ GC (median):     0.00%
 Time  (mean ± σ):   2.367 ms ±   6.185 ms  ┊ GC (mean ± σ):  14.49% ±  5.44%

           ▅▆▆█▆▆▆▅▃▁▂▁▁▃
  ▂▂▁▁▂▄▅▅████████████████▇▆▇▆▅▆▄▅▄▄▄▄▄▃▄▃▃▃▃▃▂▃▂▃▃▂▂▂▂▂▂▂▁▂▂ ▄
  1.48 ms         Histogram: frequency by time        2.87 ms <

 Memory estimate: 74.22 KiB, allocs estimate: 2697.

julia> @benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(slices=generate_data()[2])
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  245.893 ms … 277.684 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     264.101 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   263.120 ms ±  10.266 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                  █
  ▇▁▁▁▁▁▁▇▇▁▁▇▁▇▇▁▁▁▁▁▁▁▇▁▁▁▁▁▁▇▇▁▁▁▇▁▁▇▁▁▇▁▁▁▁▁▁▁█▁▁▁▁▁▁▇▁▇▇▁▇ ▁
  246 ms           Histogram: frequency by time          278 ms <

 Memory estimate: 1.69 MiB, allocs estimate: 68297.

julia> @benchmark (soc_proj_const_scale_all!(slices)) setup=(slices=generate_data()[5])
BenchmarkTools.Trial: 1693 samples with 1 evaluation.
 Range (min … max):  215.100 μs …   5.971 ms  ┊ GC (min … max): 0.00% … 74.24%
 Time  (median):     675.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   719.413 μs ± 481.005 μs  ┊ GC (mean ± σ):  1.40% ±  4.23%

  █  ▁
  ██▆█▃▂▂▂▃▂▂▂▃▂▂▃▃▂▂▃▃▂▃▄▃▄▄▄▄▃▃▃▃▂▂▂▂▂▂▃▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  215 μs           Histogram: frequency by time         1.96 ms <

 Memory estimate: 4.86 KiB, allocs estimate: 41.

I’m not sure why the CPU version is still faster than the GPU version (though I’m sure the latter can still be improved (use @profile to find bottlenecks)).

The numerical errors seem to be relatively minor, though this will depend on use-case. For example, for some random data the first and last entries of the last slice (where the numerical errors should be largest) are

 7.365164
 0.43124264

on the CPU, and

 7.3655076
 0.43124253

on the GPU.

If we increase the number of slices from 1000 to 50_000

julia> @benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths)) setup=(gen_data_tuple=generate_data(50_000); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3])
BenchmarkTools.Trial: 47 samples with 1 evaluation.
 Range (min … max):  17.988 ms … 33.174 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.123 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.654 ms ±  3.490 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▃▃▃   ▃  ▃  █▃ ▃█     ▃    ▃  ▃      ▃▃
  ▇▁▁▁▁▁▁▇███▇▁▁█▇▇█▇▇██▁██▁▁▇▁▁█▇▇▁▇█▇▇█▇▁▁▇▁▁██▁▁▁▇▁▁▁▁▁▁▁▇ ▁
  18 ms           Histogram: frequency by time        33.2 ms <

 Memory estimate: 74.38 KiB, allocs estimate: 2707.

julia> @benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(slices=generate_data(50_000)[2])
BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took 12.791 s (0.00% GC) to evaluate,
 with a memory estimate of 83.67 MiB, over 3407626 allocations.

julia> @benchmark (soc_proj_const_scale_all!(slices)) setup=(slices=generate_data(50_000)[5])
BenchmarkTools.Trial: 51 samples with 1 evaluation.
 Range (min … max):  14.192 ms … 18.877 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     16.128 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   16.337 ms ±  1.370 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃     █ ▃       █ ▃    ▃       ▃  ▃   █           ▃     ▃
  █▁▇▁▇▇█▁█▇▇▇▁▁▇▇█▇█▁▇▁▇█▇▇▁▇▁▁▇█▁▇█▁▇▁█▇▁▇▇▁▁▁▁▇▇▇█▇▁▁▁▁█▁▇ ▁
  14.2 ms         Histogram: frequency by time        18.9 ms <

 Memory estimate: 4.86 KiB, allocs estimate: 41.

the gap between GPU and CPU performance closes, though the CPU version is still fastest.
(Edit: using @inbounds drops the median time for the new GPU version to 16.746 ms.)

Thank you very much for your explanations and the related code examples. I am a beginner in Julia, and I may not be very familiar with many acceleration techniques in Julia. In the end, I chose to write a kernel in CUDA, rewriting the relevant norm calculation and simple vector BLAS operations myself. These vector-related BLAS operations are not difficult. Then I compiled them into PTX and subsequently called them using Julia. Finally, my test results were almost identical to those on the CPU, and the speed was even faster when the problem dimensions and the number of blocks were larger.

#include <thrust/device_vector.h>
#include <thrust/fill.h>

#define positive_zero 1e-20
#define negative_zero -1e-20
#define proj_rel_tol 1e-14
#define proj_abs_tol 1e-16

// n is the length of the vector, including the first element
// len is the length of the vector, not including the first element or the top two elements

// BLAS functions
__device__ double nrm2(long *n, double *x)
{
  // calculate the norm of the vector x
  double norm = 0.0;
  for (long j = 0; j < *n; ++j)
  {
    norm += x[j] * x[j];
  }
  return sqrt(norm);
}


// END OF BLAS FUNCTIONS

__device__ void soc_proj(double *sol, long *n)
{
  long len = *n - 1;
  double norm = nrm2(&len, &sol[1]);

  double t = sol[0];

  if (norm <= -t)
  {
    for (long j = 0; j < *n; ++j)
    {
      sol[j] = 0.0;
    }
  }
  else if (norm <= t)
  {
    // Do nothing, continue with the next iteration
  }
  else
  {
    double c = (1.0 + t / norm) / 2.0;
    sol[0] = norm * c;
    for (long j = 1; j < *n; ++j)
    {
      sol[j] = sol[j] * c;
    }
  }
}


extern "C" __global__ void
soc_proj_kernel(double *arr, const long *gpu_head_start, const long *ns, int blkNum)
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  // one thread per cone projection
  if (i < blkNum)
  // if (i == 0)
  {
    long n = ns[i];
    double *sol = arr + gpu_head_start[i];
    soc_proj(sol, &n);
  }
}

using Pkg
Pkg.activate("test_cpu_env")
using Random
using CUDA
using LinearAlgebra
const rpdhg_float = Float64



function soc_proj!(sol::AbstractVector{rpdhg_float})
    t = sol[1]
    x = @view sol[2:end]
    nrm_x = LinearAlgebra.norm(x)
    if nrm_x <= -t
        sol .= 0.0
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        sol[1] = nrm_x * c
        x .= x * c
    end
end


# random seed
for seed in 1:2
  Random.seed!(seed)

## cpu version
blkNum = 10000
nEveryBlk = 10000
n = nEveryBlk * blkNum
vec_cpu_before_proj = rand(n)
vec_cpu_before_proj_copy = deepcopy(vec_cpu_before_proj)
len_blk = Vector{Int64}([nEveryBlk for _ in 1:blkNum])
cum_sum_len_blk = cumsum(len_blk) 
cum_sum_len_blk .-= len_blk[1]
@assert sum(len_blk) == n
sub_vec_list = []
for i in 1:length(len_blk)
    sub_vec = @view vec_cpu_before_proj[sum(len_blk[1:i-1])+1:sum(len_blk[1:i])]
    push!(sub_vec_list, sub_vec)
end
time_start = time()
for i in 1:length(len_blk)
    soc_proj!(sub_vec_list[i])
end
time_end = time()
println("cpu time: ", time_end - time_start)

ptx_path = "./code/nvcc_test/answer/soc_proj.ptx"
func_name = "soc_proj_kernel"

mod = CuModule(read(ptx_path))
kernel = CuFunction(mod, func_name)

N = n
nThread = 256
nBlock = Int64(ceil((N + nThread - 1) / nThread))

adevice = CuArray(vec_cpu_before_proj_copy)
gpu_head_start = CuArray(Vector{Int64}([cum_sum_len_blk...]))
gpu_ns = CuArray(Vector{Int64}(len_blk))
blkNum = length(len_blk)
time_start = time()
CUDA.cudacall(
  kernel,
  ( CuPtr{Float64}, CuPtr{Int64}, CuPtr{Int64}, Int64),
  adevice, gpu_head_start, gpu_ns, blkNum;
  blocks = nBlock, threads = nThread
)
# sync
CUDA.synchronize()
time_end = time()
println("gpu time: ", time_end - time_start)
# println("adevice: ", adevice)
result = Array(adevice)
println("result ≈ vec_cpu_before_proj in norm: ", norm(result - vec_cpu_before_proj, Inf))
@assert norm(result - vec_cpu_before_proj, Inf) < 1e-9
end
# one case result
cpu time: 0.19637703895568848
gpu time: 0.01563096046447754
result ≈ vec_cpu_before_proj in norm: 1.4921397450962104e-13
1 Like

The equivalent code in Julia would be very similar:

function nrm2(x)
  # calculate the norm of the vector x
  sq_norm = 0.0f0
  for j in eachindex(x)  # or j = 1:length(x)
    @inbounds sq_norm += x[j] * x[j]
  end
  return sqrt(sq_norm)
end

function soc_proj(sol)
  @inbounds norm = nrm2(@view(sol[2:end]))
  @inbounds t = sol[1]

  if norm <= -t
    for j in eachindex(sol)
      @inbounds sol[j] = 0.0f0;
    end
  elseif norm <= t
    # Do nothing, continue with the next iteration
  else
    c = (1.0f0 + t / norm) / 2.0f0;
    @inbounds sol[1] = norm * c;
    for j = 2:length(sol)
      @inbounds sol[j] = sol[j] * c;
    end
  end
end

function soc_proj_kernel(arr, gpu_head_start, ns, blkNum)
  i = (blockIdx().x - 1i32) * blockDim().x + threadIdx().x;
  # one thread per cone projection
  if (i <= blkNum)
    @inbounds n = ns[i]
    @inbounds @views sol = (arr[gpu_head_start[i] : gpu_head_start[i] + n - 1i32])
    soc_proj(sol)
  end
  return 
end

function soc_proj_const_scale_all!(data, slice_lengths, slice_starts)
	# data::CuVector{Float32}, slice_lengths::CuVector{Int32}, slice_starts::CuVector{Int32}
	nb_threads = 256
    @cuda blocks=cld(length(data), nb_threads) threads=nb_threads soc_proj_kernel(data, slice_starts, slice_lengths, length(slice_starts))
end

Basically, use 1-indexing and replace the pointer arithmetic with @views. (I’ve also switched to Float32, but maybe your beefy GPU handles double precision well.) The @inbounds for removing bound checks are not necessary, but they can increase performance. (Alternatively, run Julia with --check-bounds=no.)


This is not the best way to get an accurate execution time measurement, as you only take a single sample (which might be affected by random ‘noise’, and also by compilation time). Instead, prefer to use BenchmarkTools.jl, Chairmarks.jl, or CUDA.jl specific methods such as CUDA.@bprofile.


If I compare all methods we discussed, your last approach turns out to be fastest. (I’m guessing my previous suggestion allocates too much.) Here are my benchmark results:

GPU Device Name: NVIDIA GeForce RTX 3070
GPU Compute Capability: 8.6.0


UNIFORM DATA: 10_000 slices of length 10_000
--------------------------------------------
GPU v1 (@threads, kernel launch per slice)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.468 s …  2.477 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.473 s             ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.473 s ± 6.297 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                      █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.47 s        Histogram: frequency by time        2.48 s <

 Memory estimate: 16.74 MiB, allocs estimate: 681105.


GPU v2 (stack, vectorised code)
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  14.537 ms … 83.696 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.550 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.323 ms ± 19.195 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▄█    ▁
  ▆▁▁▁▁▁▁▁▁▁▁██▆▆▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▆ ▁
  14.5 ms         Histogram: frequency by time        83.7 ms <

 Memory estimate: 189.73 KiB, allocs estimate: 3685.


GPU v3 (GPU threads corresponding to data elements)
BenchmarkTools.Trial: 35 samples with 1 evaluation.
 Range (min … max):  25.561 ms …  29.311 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.590 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.780 ms ± 688.798 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▁█ ▄▁█▁▄▁                   ▁
  ▆▁▁▁▁▆▁▁▁▁▁██▆██████▆▆▆▆▁▁▆▁▁▁▁▁▁▆▁▆▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  25.6 ms         Histogram: frequency by time         29.3 ms <

 Memory estimate: 46.73 KiB, allocs estimate: 1733.


GPU v4 (GPU threads corresponding to slices)
BenchmarkTools.Trial: 39 samples with 1 evaluation.
 Range (min … max):  10.042 ms …  11.880 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.833 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.842 ms ± 616.806 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██                  ▁              ▁    ▁       ▁  █
  ██▆▆▆▁▁▁▁▁▁▆▆▁▁▆▆▁▁▆█▆▁▁▁▆▆▁▁▆▆▁▁▆▁█▁▁▁▁█▁▁▆▁▁▆▁█▁▁█▁▁▁▆▁▆▁▆ ▁
  10 ms           Histogram: frequency by time         11.9 ms <

 Memory estimate: 912 bytes, allocs estimate: 39.


CPU v1 (@threads, CPU threads corresponding to slices)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):  25.889 ms …  29.299 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.575 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.801 ms ± 897.141 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃  ▃██ ▃    ▃▃▃
  █▁▇███▁█▇▁▁▇███▁▇▇▁▁▁▇▁▁▁▁▇▁▁▇▁▁▁▁▁▇▁▇▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▇ ▁
  25.9 ms         Histogram: frequency by time         29.3 ms <

 Memory estimate: 4.86 KiB, allocs estimate: 41.



NON-UNIFORM DATA: 10_000 slices of length between 1 and 20_001
--------------------------------------------------------------
GPU v1 (@threads, kernel launch per slice)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min … max):  2.412 s …   2.500 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.456 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.456 s ± 62.486 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                       █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  2.41 s         Histogram: frequency by time         2.5 s <

 Memory estimate: 16.74 MiB, allocs estimate: 681059.


GPU v3 (GPU threads corresponding to data elements)
BenchmarkTools.Trial: 35 samples with 1 evaluation.
 Range (min … max):  25.111 ms …  27.938 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     25.557 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.969 ms ± 791.694 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ █  █ ▄                      ▁  ▁ ▁
  █▆█▁▆█▆█▆▆▆▆▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁█▁▆█▁█▆▁▆▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▁▆ ▁
  25.1 ms         Histogram: frequency by time         27.9 ms <

 Memory estimate: 46.73 KiB, allocs estimate: 1733.


GPU v4 (GPU threads corresponding to slices)
BenchmarkTools.Trial: 39 samples with 1 evaluation.
 Range (min … max):   9.728 ms …  12.006 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.555 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.434 ms ± 448.279 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂                    ▂▂█ ▅
  █▅▅▅▅▅▅▅▁▁▁█▁▁▅▅▁▅▁▁████▅█▅▁▅▅▅▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  9.73 ms         Histogram: frequency by time           12 ms <

 Memory estimate: 912 bytes, allocs estimate: 39.


CPU v1 (@threads, CPU threads corresponding to slices)
BenchmarkTools.Trial: 31 samples with 1 evaluation.
 Range (min … max):  25.818 ms …  29.608 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.733 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   27.064 ms ± 979.965 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁ ▁   █ ▁    ▁▁       ▁
  ▆▁▆▁▆▁█▆█▁▁▆█▆█▆▆▁▁██▁▆▁▁▁▁▆█▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▁▆▆ ▁
  25.8 ms         Histogram: frequency by time         29.6 ms <

 Memory estimate: 4.86 KiB, allocs estimate: 41.

The CPU is an i7-7700K.

Full code
using CUDA
using CUDA: i32
using LinearAlgebra
using Base.Threads
using BenchmarkTools

function generate_data(nb_slices=1000, min_slice_length=500, max_slice_length=1500)
	slice_lengths_cpu = [rand(min_slice_length:max_slice_length) for _ in 1:nb_slices]
	slice_lengths = CuArray(slice_lengths_cpu)
	slice_lengths_cs_cpu = cumsum(slice_lengths_cpu)

	data = CUDA.rand(slice_lengths_cs_cpu[end])  # (or sum(slice_lengths_cpu) )
	data_cpu = Array(data)
	slices = [i == 1 ? @view(data[1:slice_lengths_cs_cpu[1]]) : @view(data[slice_lengths_cs_cpu[i - 1] + 1 : slice_lengths_cs_cpu[i]]) for i in eachindex(slice_lengths_cpu)]
	slices_cpu = [i == 1 ? @view(data_cpu[1:slice_lengths_cs_cpu[1]]) : @view(data_cpu[slice_lengths_cs_cpu[i - 1] + 1 : slice_lengths_cs_cpu[i]]) for i in eachindex(slice_lengths_cpu)]
	
	slice_starts_cpu = Array{Int32}(undef, length(slices_cpu))
	slice_starts_cpu[1] = 1i32
	slice_starts_cpu[2:end] .= @view(slice_lengths_cs_cpu[1:end-1]) .+ 1i32
	slice_starts = CuArray(slice_starts_cpu)
	
	return data, slices, slice_lengths, slice_starts, data_cpu, slices_cpu, slice_lengths_cpu, slice_starts_cpu
end


function soc_proj_const_scale!(sol::AbstractVector)
    x2end = @view sol[2:end]
    t = sol[1]
    nrm_x = norm(x2end)
    if nrm_x <= -t
        fill!(sol, 0.0)
        return
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        sol[1] = nrm_x * c
        x2end .*= c
        return
    end
end

function soc_proj_const_scale!(sol::CuVector)
    x2end = @view sol[2:end]
    CUDA.@allowscalar t = sol[1]
    nrm_x = CUDA.norm(x2end)
    if nrm_x <= -t
        CUDA.fill!(sol, 0.0)
        return
    elseif nrm_x <= t
        return
    else
        c = (1.0 + t/nrm_x)/2.0
        CUDA.@allowscalar sol[1] = nrm_x * c
        x2end .*= c
        return
    end
end


function soc_proj_const_scale_all!(slices)
	@threads for s in slices
		soc_proj_const_scale!(s)
    end
end


function soc_proj_const_scale_all!(solm::CuMatrix)
    x2ends = @view solm[2:end, :]
    ts = @view solm[1, :]
    nrm_xs = sqrt.(mapreduce(t -> t^2, +, x2ends, dims=1)) |> vec
    solm[:, nrm_xs .<= -ts] .= 0
    b = nrm_xs .> ts
    ts_b = ts[b]  # No view as cs_b line below will error
    nrm_xs_b = @view nrm_xs[b]
    cs_b = (1.f0 .+ ts_b ./ nrm_xs_b) ./ 2
    # Note that on the GPU you want to use 32-bit floats. 
    # (If solm uses Float64, 1.f0 will be promoted to Float64 anyway.)
    solm[1, b] .= nrm_xs_b .* cs_b
    solm[2:end, b] .*= cs_b'  # e.g. 999 x 123 .* 1 x 123
end

function soc_proj_const_scale_all!(data, slice_lengths)
	# data::CuVector{Float32}, slice_lengths::CuVector{Int32}
	data_sq = data .^ 2
	cs_data_sq = accumulate(+, data_sq)
	slice_ends = accumulate(+, slice_lengths)  # slices[i][end] == data[slice_ends[i]]
	slice_starts = CuVector{Int32}(undef, length(slice_lengths))  # slices[i][begin] == data[slice_starts[i]]
	CUDA.@allowscalar slice_starts[1] = 1i32
	@inbounds slice_starts[2:end] .= @view(slice_ends[1:end-1]) .+ 1i32
	
	norms = CuVector{Float32}(undef, length(slice_lengths))  # When we're done: norms[i] == norm(slices[i][2:end]) (possibly some degree of numerical errors though)
	@inbounds CUDA.@allowscalar norms[1] = sqrt.(cs_data_sq[slice_ends[1]] - data_sq[1])
	@inbounds @views norms[2:end] .= sqrt.(cs_data_sq[slice_ends[2:end]] .- cs_data_sq[slice_ends[1:end-1]] .- data_sq[slice_starts[2:end]])
	
	
	data_idx_is_slice_start = CUDA.zeros(Int32, length(data))  
	@inbounds data_idx_is_slice_start[slice_starts] .= 1i32
	data_slice_indices = accumulate(+, data_idx_is_slice_start)  # If data index i belongs to slice j, we have data_slice_indices[i] == j
	
	@inbounds ts = @view data[slice_starts]
	
	# Find optimal number of threads per block
	kernel = @cuda launch=false soc_proj_const_scale_kernel!(data, ts, norms, data_idx_is_slice_start, data_slice_indices)
    config = launch_configuration(kernel.fun)
    @cuda blocks=max(cld(length(data), config.threads), config.blocks) threads=config.threads soc_proj_const_scale_kernel!(data, ts, norms, data_idx_is_slice_start, data_slice_indices)
	
	# The garbage collector is a bit too lazy in our tight benchmark loop, so clean up ourselves.
	CUDA.unsafe_free!(data_sq)
	CUDA.unsafe_free!(cs_data_sq)
	CUDA.unsafe_free!(slice_ends)
	CUDA.unsafe_free!(slice_starts)
	CUDA.unsafe_free!(norms)
	CUDA.unsafe_free!(data_idx_is_slice_start)
	CUDA.unsafe_free!(data_slice_indices)
end

function soc_proj_const_scale_kernel!(data, ts, norms, data_idx_is_slice_start, data_slice_indices) 
	data_idx = threadIdx().x + (blockIdx().x - 1i32) * blockDim().x
	while data_idx <= length(data)
		@inbounds slice_idx = data_slice_indices[data_idx]
		nrm = norms[slice_idx]
		inv_nrm = 1.f0 / nrm
		@inbounds t = ts[slice_idx]
		
		if nrm <= -t
			@inbounds data[data_idx] = 0
		elseif nrm > t
			c = (1.f0 + t * inv_nrm) * 0.5f0
			if data_idx_is_slice_start[data_idx] == 1i32
				@inbounds data[data_idx] = nrm * c
			else
				@inbounds data[data_idx] *= c
			end
		end
		
		data_idx += blockDim().x * gridDim().x
	end
	return
end


function soc_proj_const_scale_all!(data, slice_lengths, slice_starts)
	# data::CuVector{Float32}, slice_lengths::CuVector{Int32}, slice_starts::CuVector{Int32}
	nb_threads = 256
    @cuda blocks=cld(length(data), nb_threads) threads=nb_threads soc_proj_kernel(data, slice_starts, slice_lengths, length(slice_starts))
end


function nrm2(x)
  # calculate the norm of the vector x
  sq_norm = 0.0f0
  for j in eachindex(x)  # or j = 1:length(x)
    @inbounds sq_norm += x[j] * x[j]
  end
  return sqrt(sq_norm)
end


function soc_proj(sol)
  @inbounds norm = nrm2(@view(sol[2:end]))
  @inbounds t = sol[1]

  if norm <= -t
    for j in eachindex(sol)
      @inbounds sol[j] = 0.0f0;
    end
  elseif norm <= t
    # Do nothing, continue with the next iteration
  else
    c = (1.0f0 + t / norm) / 2.0f0;
    @inbounds sol[1] = norm * c;
    for j = 2:length(sol)
      @inbounds sol[j] = sol[j] * c;
    end
  end
end


function soc_proj_kernel(arr, gpu_head_start, ns, blkNum)
  i = (blockIdx().x - 1i32) * blockDim().x + threadIdx().x;
  # one thread per cone projection
  if (i <= blkNum)
    @inbounds n = ns[i]
    @inbounds @views sol = (arr[gpu_head_start[i] : gpu_head_start[i] + n - 1i32])
    soc_proj(sol)
  end
  return 
end


device = CUDA.device()
println("GPU Device Name: ", CUDA.name(device))
println("GPU Compute Capability: ", CUDA.capability(device))
println("\n")

println("UNIFORM DATA: 10_000 slices of length 10_000")
println("-" ^ 44)
println("GPU v1 (@threads, kernel launch per slice)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(gen_data_tuple=generate_data(10_000, 10_000, 10_000); slices=gen_data_tuple[2]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\nGPU v2 (stack, vectorised code)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(stacked)) setup=(gen_data_tuple=generate_data(10_000, 10_000, 10_000); slices=gen_data_tuple[2]; stacked=stack(slices)) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4)); CUDA.unsafe_free!(stacked)))
println("\n\nGPU v3 (GPU threads corresponding to data elements)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths)) setup=(gen_data_tuple=generate_data(10_000, 10_000, 10_000); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\nGPU v4 (GPU threads corresponding to slices)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths, slice_starts)) setup=(gen_data_tuple=generate_data(10_000, 10_000, 10_000); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3]; slice_starts=gen_data_tuple[4]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\nCPU v1 (@threads, CPU threads corresponding to slices)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(gen_data_tuple=generate_data(10_000, 10_000, 10_000); slices=gen_data_tuple[6]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\n")

println("NON-UNIFORM DATA: 10_000 slices of length between 1 and 20_001")
println("-" ^ 62)
println("GPU v1 (@threads, kernel launch per slice)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(gen_data_tuple=generate_data(10_000, 1, 20_001); slices=gen_data_tuple[2]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\nGPU v3 (GPU threads corresponding to data elements)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths)) setup=(gen_data_tuple=generate_data(10_000, 1, 20_001); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\nGPU v4 (GPU threads corresponding to slices)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(data, slice_lengths, slice_starts)) setup=(gen_data_tuple=generate_data(10_000, 1, 20_001); data=gen_data_tuple[1]; slice_lengths=gen_data_tuple[3]; slice_starts=gen_data_tuple[4]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
println("\n\nCPU v1 (@threads, CPU threads corresponding to slices)")
display(@benchmark (CUDA.@sync soc_proj_const_scale_all!(slices)) setup=(gen_data_tuple=generate_data(10_000, 1, 20_001); slices=gen_data_tuple[6]) teardown=(CUDA.unsafe_free!.(gen_data_tuple[i] for i in (1, 3, 4))))
2 Likes