Code snippet for multiGPU fft

Hi,

I would like to perform a 3d FFT on multi GPUs. It can be done in cuda as explained here. However the functions required do not seem ported in CuArrays. I was wondering if anybody has ever done this in Julia?

A second question concerns the vectorial structure. If my array x is spread on two GPUs, I would like to be able to do things like x .= f.(x) while still doing this. But this might be too difficult to do for now…

Best,

cufftXt is not wrapped indeed, so we don’t have convenient multi-GPU FFT functionality right now. Adding wrappers (just the wrappers, not the high-level functionality to make this a proper Julian API) is a pretty easy task though, so if you need this functionality you could look into providing the necessary wrappers yourself See e.g. Initial work towards cublasXt wrappers by kshyatt · Pull Request #294 · JuliaGPU/CuArrays.jl · GitHub where similar cublasXt functionality was added. You can also open an issue on the CuArrays repo.

Similarly, we don’t have an convenient multi-GPU abstraction that allows multi-GPU broadcasting. There’s some initial support for using DistributedArrays.jl with CuArrays.jl, i.e. by using DArray{CuArray}, but there’s many inefficiencies to be fixed there before this is a usable solution for multi-GPU computations.

I looked at it but I could not find a simple example of how to use it. Do you have such example? It seems to me that cublasXt would do the multi-GPU broadcasting.

I will try to provide some of the wrappers of cufftXt.

There’s some _xt testsets in CuArrays/test/blas.jl, but nothing actually documented. Those APIs aren’t sufficient to implement broadcasting, since we need to be able to compile and launch kernels for the general case (so we’d need to deal with multi-GPU indexing ourselves).

I’d be willing to try this if I knew whether the cufftXt part allows to compute fft of arrays that do not fit in a single GPU. So far, I have not found an example of this…

I’m interested in performing FFT on multiple GPUs and found this thread. I’m curious if there has been any progress on this issue since the OP was posted two years ago.

Not on CUDA.jl at least.