GPU code has a high amount of CPU allocations?


More of a “meta” question.

I need some pointers on this. I have some Julia code which allocates 0 times on the GPU other than the initial making of arrays for memory etc.

But I have such a high number of CPU allocations, such as 471 million allocations and 45 gib, after 150000 calculation steps?

So each calculation step has 3140 allocations and 0.3 gb?

Isn’t that a bit excessive when the calculation is performed on GPU?

Kind regards

Does it hurt performance?

Also, which GPU back-end are you talking about?

I am talking about CUDA, my bad for not making it clear initially.

To be honest, I am not sure if it hurts performance. It just “scares” me a bit to see millions of CPU allocs and memory being used when running on GPU only. I think I found part of the issue though.

I was using GPU functions multiple times in the inner-loop as for example:

Which meant that it would reconfigure and recompile the kernel inside of the hot loop every iteration. Instead I should precompile this once and then use the precompiled kernel function, i.e. issue a “return kernel” and use that “kernel” in my hot loop.

This proved to increase speed by 20% and reduce allocs and memory a bit.

This is probably what was meant by, “it is smarter to have all kernel calls happening inside of one @CUDA.sync”, which I believe I read somewhere.

Here I share an example, F\rho is the kernel returned by a top-level function and the dRdtVersion0_Kernel is a top level function, similar to the GPU_ADD function from above.

By using the precompiled kernel, if I needed to use the same GPU function more than once, I could get 20% speed up by using that directly, F\rho.

When I fix this in a few days I think I should see lower allocs, faster code and a more happy me :slight_smile:

Kind regards

I’ve had similar observations to what @Ahmed_Salih is talking about here, and have been wondering and had in mind to post some questions myself along these lines.

  • What are these allocation? In my small testing I have found a kernel (the kernel I was working with at least) call to allocate approximately 4.5kb of host memory, and subsequent calls to each allocate slightly less than that on top. It seems more or less a linear increase pr. call. It isn’t much, but when you scale up some iterative process it doesn’t take long to become significant numbers.
  • Would it be possible to reuse the host memory a kernel call allocates for the similar allocation in the subsequent kernel calls?
  • Are these allocations being handled by the gc system such that it won’t be an issue if the benchmarked allocations when above the system memory? I admit, I haven’t attempted to use up all the memory on my machine to test it.

I can inform that using the inner-kernel function instead of the outer-kernel function as described in my post above reduced memory allocation ~10-15% down to 35.8 gb. 471 M CPU allocations stayed consistent unfortunately

Parameters need to be boxed, and the callable kernel object needs to be allocated.

Maybe, but it would be complex, as we’d need to respect dependencies between kernels (essentially re-implementing CUDA’s stream dependency model). I’m not sure that’s worth it.

Yes, they are GC-managed.

All that said, there’s probably quite some low hanging fruit left, so consider using Cthulhu to hunt type instabilities in CUDA.jl or a profiler to find hotspots. It’s all just Julia code :slightly_smiling_face:

1 Like

What does it mean that parameters need to be boxed?

Thank you

Memory needs to allocated to put kernel parameters in.

1 Like