I found the solution:
There is no need to invoke CUDA.CUFFT.cufftPlanMany. The functionality of batched fft’s is contained in julias AbstractFFT structure.
Eg if N ffts of size 128^3 need to be calculated, then one simply copies the data of the 128^3 arrays in an 3+1 dimensional array (extension in each dimension 128,128,128, N): the first one to newarray(:,:,:,1), the second one to newarray(:,:,:,2) and so forth up to newarray(:,:,:,N).
Having assembled newarray, the next step is to simply performing the fft along the first 3 dimensions:
fft(newarray,[1,2,3]). This automatically computes all the N fft’s.
One then extracts the individual N 128^3 arrays from the returned 3+1 dimensional array like outlined above. Works as well on the GPU!