My code runs a function that needs to do a crap top of operations to fill a very long vector. Each position of the vector is independent of the next. I coded this both on the CPU and GPU, with a custom kernel.
I need to run this function hundreds of times and for large cases each call can take a minute or two. The GPU version currently runs 30% or so faster than the CPU version.
I’m thinking I can cut the time almost in half by combining the CPU and GPU. So maybe at each call I initialize an empty vector on CPU and GPU, then start running the GPU from the top and the CPU from the bottom and when they meet, I can combine the vectors and stop.
Has anyone done something in these lines? Maybe I’m overcomplicating things and there’s a way to call my kernel on CPU and GPU cores more directly.
Thanks a lot!
It’s hard to say without more detail, but if all your computations are fully independent and your GPU is (only) 30% faster than the CPU, then I suspect your GPU code is just insufficiently taking advantage of the hardware. Additionally, the CPU+GPU scenario you describe seems like it might require a unified memory architecture to be worthwhile, as sending data to and fetching data from the GPU is not a particularly fast transaction. Can you provide an example of the vector-filling kernel, what CPU parallelization strategies you’ve tried, and tell us what sort of hardware you are using to compute?
I’m quite certain my GPU implementation is not great! =]
The memory doesn’t seem to be a major issue. In the beginning of my function I copy a bunch of arrays to the GPU and copy the array I build back at the end. This copying process seems very fast compared to the rest.
My CPU is a 20 core Xeon, my GPU is a Quadro RTX 4000.
My kernel is a Biot-Savart solution to a problem with hundreds of thousands of edges. So for every edge, I need to do some math on every other edge and sum their effects. This is what goes in the array I construct.
Does that give you some idea of what I’m doing, or is it still too vague?
Thanks a lot!
Oh, and regarding the parallel strategy, it is not great. To avoid race conditions, I have to start at the target point and then look at every edge and compute their summed effects to fill the array. So I parallelize the targets, both on the CPU (with a simple @threads) and the GPU (with a kernel call).
If I understand, you have three nested loops,
for i in 1:length(out) for j in 1:length(edges) s = 0.0 for k in 1:length(edges) if k != j s += doSomeMath(edges[j], edges[k]) end end out[i] = s end
Where the outer loop is parallelized using
Threads or a kernel call? And the inner loops range over 10^5 entries each? One way to diagnose the slowdown would be to replace the interior work with something trivial, to figure out if the
doSomeMath function in my sketch is the analogous slowdown. It would be easier to diagnose what’s happening if we had some example code, a minimal working example. From my naive construction of what you’ve described, it could be several issues related to memory access patterns or simply A Big Problem.
Yeah, something like that. Let’s say:
function BiotSavartAndOtherStuff(target,edges) ans=0.0 for j in 1:N ans += mathAndMoreMath(target,edges[j]) end return ans end N=100000 answer=zeros(N) @threads for i in 1:N answer[i]=BiotSavartAndOtherStuff(edges[i],edges) end
And in the GPU:
function kernelfunction!(answerGPU,edgesGPU) i=(blockIdx().x - 1) * blockDim().x + threadIdx().x if i <= N answerGPU[i] = BiotSavartAndOtherStuff(edgesGPU[i],edgesGPU) end return nothing; end edgesGPU=CuArray(edges) answerGPU=CUDA.zeros(Float64,N) setupkernel,config,threads,blocks CUDA.@sync kernel(answerGPU,edgesGPU; threads=threads, blocks=blocks); answer=Array(answerGPU)
I do think there’s a lot I can do to improve memory access and so on. There are also other methods to speedup these types of calculations (fast multipole). It is indeed A Big Problem, as there’s simply a lot of math and edges. Hence, I was just wondering if there was a simple way to leverage both my CPU and GPU, as when one is working the other one isn’t doing much and they have similar performance, so using both would cut my time in half.
Any advice is appreciated!
Stupid question: did you measure with
@btime and check with a profiler? I’m only halfway qualified to talk about the CPU part of the question, but would suspect
@tturbo should do better for the CPU, see this thread for example.
And is an exemplary MWE really out of reach?
I did profile this quite a bit on the CPU side. For these big and embarrassingly parallel computations,
@threads works quite well. I did notice some loops in other parts of the code that could likely benefit from Polyester, but they are not driving the cost right now.
I checked KernelAbstractions’ status before posting and from the docs it seems a kernel can be launched on the CPU or the GPU, but not both.
A MWE is not really doable, as the math part is enormous, as is the input and the time it takes for it to get expensive. I’d have to spend a lot of time on rewriting the whole thing so that it could be a MWE. In any case, I’m not asking for help optimizing the code(although that would be fantastic), I was just trying to figure out if there was an easy way to not waste my CPU while my GPU was busy. I guess the answer is no?
Thanks, but you don’t tell us about the allocation side of things: if there’s something going on I’d suspect garbage collection to play a role at some point.
I don’t know. But if your code is embarrassingly parallel (I learned about this today and I assume;) I see two ways out
- using GPU (then your GPU code probably should be a bit faster than it is currently) and asking for help about the CUDA kernels
Distributed, although here data independence could play a role
Yep, I optimized allocations a lot. There’s no GC issue. Again, it’s just a lot of math on a lot of points. =]
I’m contemplating Distributed and running on several CPUs or GPUs. I haven’t found too many examples out there, but maybe that’s the road I’ll have to take.