Cub library functions in cuda.jl

nethaji8 · October 9, 2022, 11:51am

Hi everyone,
I am trying to implement block-level reduction operations in the Cuda Julia kernel.
Does Julia has “cub” kind of device-level functions…?
How can I use the cub library device intrinsics for the following Julia code…?

For example, consider the following example of adding an array of 64 elements in Cuda c using cub. How does the corresponding code look like in Cuda Julia below…?

#include <cub/cub.cuh>
#include <cuda.h>
#include <vector>
#include <iostream>
#include <numeric>
#include <cub/cub.cuh>

using namespace std;
using namespace cub;

const unsigned int N=64;

__global__ void kernel(int *array, int *solution)
{
    // Specialize BlockReduce type for our thread block where N/2 is number of threads launched
    typedef BlockReduce<int, N/2, BLOCK_REDUCE_RAKING> BlockReduceT;
    // Shared memory
    __shared__ typename BlockReduceT::TempStorage temp_storage;
    // Per-thread tile data
    int number_of_items_per_thread = 2; // (N/number of threads)
    int data[2];
    LoadDirectStriped<32>(threadIdx.x, array, data);

    // Compute sum
    int aggregate = BlockReduceT(temp_storage).Sum(data);

    // Store aggregate and elapsed clocks
    if (threadIdx.x == 0)
    {
        *solution = aggregate;
    }

}

int main()
{
    vector<int> input(N); 
    iota (std::begin(input), std::end(input), 0);

    for (auto i : input)
        std::cout << '\t' << i;
    std::cout << '\n';

    int *device, *sol;
    cudaMalloc((void**)&device,N*sizeof(int));
    cudaMallocManaged((void**)&sol,sizeof(int));

    cudaMemcpy(device,&input[0],N*sizeof(int),cudaMemcpyHostToDevice);
    kernel <<<1,N/2>>>(device, sol);

    int i = cudaDeviceSynchronize();

    cout << "solution\t" << *sol << "\n";
 
}

using CUDA

N = 64

function kernel(input, sol)

    return
end

function main()

    input = CuArray(collect(0:N-1))

    println(input)

    sol = 0

    @cuda blocks =1 threads=32 kernel(input, sol)
    synchronize()
    println(sol)

end

main()

maleadt · October 10, 2022, 10:55am

CUB is a C++ header library so not usable in Julia.

nethaji8 · October 10, 2022, 11:27am

Dear @maleadt , is there any workaround to use block-level, warp-level reduction intrinsics (CUB) in the Julia kernel…?

maleadt · October 10, 2022, 1:28pm

Implement it in Julia. For example, CUDA.jl has the (unexported) block_reduce function that may be useful.

nethaji8 · October 10, 2022, 2:36pm

Dear @maleadt , thank you for your answers can you please give one last response…?

Can we use cooperative_groups functions in Julia(directly or by using wrappers) and perform the reduction algorithm instead…?
I see that we can call Cuda C kernels from Julia by wrapping the code, can’t we use the same technique to call CUB library functions…?
Lastly, I’m unable to find the block_reduce function in the documentation, can you please refer me to the link…?

maleadt · October 11, 2022, 5:02pm

Yes, but few functions are wrapped: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/device/intrinsics/cooperative_groups.jl. You’ll probably need to add more.

CUB doesn’t expose kernels, but tools for use within a kernel. That’s very different.

There’s nothing to link to, it’s an unexported function for internal use. You could probably repurpose it though, or even create a PR to make it more easily resusable / documented / exported.

Topic		Replies	Views
Using mapreduce on GPU with CUDA.jl GPU question , cuda	1	686	December 25, 2023
One example from `GPU programming in Julia \| Workshop \| JuliaCon 2021` GPU question , gpu	0	362	April 5, 2022
Tutorial on GPU programming on julia GPU	5	5860	March 19, 2019
Base function in Cuda kernels General Usage cudanative , cuda	8	3193	March 15, 2019
How do I to transform mapreduce function to work well with CUDA? GPU	5	1529	May 14, 2021

Cub library functions in cuda.jl

Related topics