I want to build histogram from large matrix of floats, for which a matrix of integers of the same size identify the bin in which each of those float must be allocated. It could also be seen as a groupby aggregation, but with a different index for each column. The scatter_add! function developed her…

[image] maleadt: You’re right though that there is something wrong with atomics and shared memory, I only tested them on global memory. I’ll file a bug. Should be fixed with Avoid address space casts. by maleadt · Pull Request #642 · JuliaGPU/CUDAnative.jl · GitHub

I can’t add much, but a quick google uncovered this, which might help [image] GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell | NVIDIA... Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogra…

Thanks for pointing in a promising direction. I’ll roll up my sleeves to explore further the 2-step aggregation mentionned in the post. Might call again for help later:)

@maleadt Initial explorations of shared memory along aggregation resulted in poor performance relative to the naive atomic_add! on global memory. I followed the exampe presented at slides 16 as a toy example, which is a simple dot product. Below, the atomic_add performs roughly 2x faster than the s…

[image] jeremiedb: Or are there inefficiencies in its implementation? You should use @inbounds if you know it to be safe – bounds checking is very expensive. [image] jeremiedb: CUDAnative.atomic_add!(pointer(shared, tid), shared[Cuint(tid+i)]) but resulted in CUDA error: an illegal memo…

Thanks for the tips. Adding inbounds, fixing i type stability through ÷ and reducing the number of threads per block resulted in a 4X speedup compared to the atomic_add! version. I remain perplexe howver that the operation @inbounds shared[tid] += shared[tid+i] works but the following do not: CUDAn…

[image] jeremiedb: using CUDAnative shared = CUDAnative.@cuStaticSharedMem(Float32, 8) fill!(shared, 0) CUDAnative.atomic_add!(pointer(shared, 1), 2.2f0) You can’t execute GPU code on the CPU, that will crash for sure. You’re right though that there is something wrong with atomics and shared…

Thanks for the quick fix, very appreciated!

Is it possible that there would be an issue remaining with atomic operations in shared memory following the fix in #642 in CUDAnative? Using the variation on the test that was introduced with the fix the following works: using CUDA function kernel3(x) tid = threadIdx().x shared = @cuStaticS…

Further reduced to: using CUDA function kernel() tid = threadIdx().x shared = @cuStaticSharedMem(Float32, 4) CUDA.atomic_add!(pointer(shared, tid), shared[tid + 2]) sync_threads() CUDA.atomic_add!(pointer(shared, tid), shared[tid + 2]) return end function main() @cuda …

Kernel for building histogram on GPU

Specific Domains GPU

maleadt February 5, 2020, 7:06am 3

Doing global atomic additions from every thread, like that kernel does, is going to be very expensive (read: slow). The link above suggests doing so in shared memory first, which indeed is going to improve performance by a lot.

"Quality of life" functions for CUDA.jl or GPUArrays.jl

Topic		Replies	Views
Accessing array elements too slow? GPU	10	717	April 23, 2021
I don't understand why it is slower with CuStaticSharedArray New to Julia gpu , cuda , sharedarrays , cudajl	9	417	March 17, 2025
GPU-Kernel function for fast matrix multiplication using shared memory GPU kernel	1	1800	August 13, 2021
CUDA sum kernels, threads and blocks, complex values Performance gpu , cuda	2	1465	February 3, 2021
Atomic operations issue on StaticArrays with CUDAnative GPU	2	1027	May 17, 2020

Kernel for building histogram on GPU

Related topics