I’ve got something that works, but it needs to work faster. I’ve tested different methods outside of Julia that can do this much faster, and by comparison with peakflops.jl and a rudimentary calculation of how many operations are needed, this should be able to execute much faster
That something is a nearest neighbor interpolation summing method, here’s some pseudocode:
using CuArrays, CUDAnative, CUDAdrv function interpSum!(out,in,<variables used to index>) i = threadIdx().x j = blockIdx().x k = blockIdx().y for l <compute x1 and x2> for m = @inbounds out[i,j,k,m] += in[x1,x2,l,m] end end return nothing end @time CuArrays.@sync @cuda threads=512 blocks=(200,200) interpSum!(out,in,<variables used to index>)
the part I want to stress is:
@inbounds out[i,j,k,m] += in[x1,x2,l,m]
if I comment that out, the code takes ~50 ms to run, with it being used, code takes 1000 ms to run.
if I replace it with
@inbounds out[i,j,k,m] += 1, it takes 300 ms to run (barely acceptable)
Which tells me it’s a memory access problem, given that to the compiler, the indices x1 and x2 are random (and there aren’t any ways for me to make them ordered or predictable)
@inbounds map[i,j,k,m] += ldg(data,index) where I compute the 1 dimensional index, and it is exactly as slow (I’ve read in other forums that ldg is read only and should be faster, but here it is not)
I’ve also tried interpolation methods in arrayfire, I had trouble with the memory management, if I find that code again, I’ll come back here and post it.
Fundamentally, I’m interpolating and summing, so if there already exists a fast gpu interpolation method, that would solve my problem, and I’ve seen a lot of convincing benchmarks that indicate texture memory would work really well (and I’m pretty sure that’s what arrayfire uses, but it reallocates the memory every time you call interpolation method, so after a few calls I get a memory allocation error and I have to restart Julia)