I have two P5000 and it did finish but it required 16gb of gpu ram the only option I can think of is unified memory. I know the ArrayFire library has support for unified memory so I would look into it and see if it’s been implemented in julia’s ArrayFire.
If you use saved plans to do the transforms there is less stress. (It looks as if you were intending to do that at some point.) Also, putting the for loop in a function seems to give the system a better chance to clean up. With these changes I could run your problem on a small GPU, although Julia did grab all of its memory during the loop.
The problem is not strictly related to FFTs - just creating the temporaries for the fftshift calls leads to huge memory usage.
I just replaced your product expression with ifftshift(cv.kernel .* fftshift(x)) and watched the memory status. On second thought, my last conclusion is probably wrong - the garbage collection seems to run only when needed, and this works for the cases without repeated plan generation, but not with your original version.
I don’t really understand that question…
You plan an inplace fft and preallocate a buffer for it, so you can have allocation free hot loops, like the one in possion_solve - calling poission_solve should pretty much have 0 allocations.