Metal.jl does not speed up FFT

I was hoping to explore the possibility of using Metal.jl to speed up Fourier transfer without needing to copy between CPU and GPU. However, my benchmark results showed no improvement in speed which makes me suspect that it is not using the GPU on the M1 Max chip at all. I guess additional development is needed to eventually make it work, but I’m not sure whether this is related to Metal.jl, FFTW.jl, or even Apple’s own GPU libraries… Can anybody provide any advice?

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Metal

julia> using FFTW

julia> using BenchmarkTools

julia> dims = tuple(1024, 1024)
(1024, 1024)

julia> arr = rand(Float32, dims...);

julia> arr_mtl = MtlArray(arr);

julia> @benchmark fft(arr)
BenchmarkTools.Trial: 215 samples with 1 evaluation.
 Range (min … max):  21.841 ms …  26.391 ms  ┊ GC (min … max): 0.00% … 10.58%
 Time  (median):     22.913 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.308 ms ± 927.937 μs  ┊ GC (mean ± σ):  2.16% ±  3.32%

          ▄ ▆▄█▃▆▄▂
  ▃▁▃▃▃▁▅▇█████████▇██▇▄▄▅▄▁▄▃▁▄▄▆▄▄▅▅▃▅▄▆▃▆▅▇▄▄▄▃▅▄▁▃▃▁▁▃▁▁▁▃ ▄
  21.8 ms         Histogram: frequency by time         25.9 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 30.

julia> @benchmark fft(arr_mtl)
BenchmarkTools.Trial: 210 samples with 1 evaluation.
 Range (min … max):  22.482 ms …  26.350 ms  ┊ GC (min … max): 0.00% … 11.00%
 Time  (median):     23.411 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.816 ms ± 960.537 μs  ┊ GC (mean ± σ):  2.67% ±  3.50%

          ▂▃█ ▂  ▂   ▂
  ▅▄▆▃▆▇▆▇███▇████▇▅▃█▃▆▁▃▅▃▁▃▃▄▄▅▃▅▆▃▇▄▅▃▃▄▄▅▆▇▃▆▄▅▃▄▆▄▄▃▁▃▁▄ ▄
  22.5 ms         Histogram: frequency by time         25.9 ms <

 Memory estimate: 20.00 MiB, allocs estimate: 43.

For comparison, CUDA.jl gives a big increase in speed for the same test (AMD 3990X, NVIDIA 3090)

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using CUDA

julia> using FFTW

julia> using BenchmarkTools

julia> dims = tuple(1024, 1024)
(1024, 1024)

julia> arr = rand(Float32, dims...);

julia> arr_cu = CuArray(arr);

julia> @benchmark fft(arr)
BenchmarkTools.Trial: 92 samples with 1 evaluation.
 Range (min … max):  43.485 ms … 68.050 ms  ┊ GC (min … max): 0.00% … 0.91%
 Time  (median):     56.395 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   54.326 ms ±  6.909 ms  ┊ GC (mean ± σ):  0.46% ± 0.90%

  ▇                             ▄▁ █
  █▃▁▁▁▁▃▁▃▃▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▆██▄█▁▁▃▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▃ ▁
  43.5 ms         Histogram: frequency by time          68 ms <

 Memory estimate: 16.00 MiB, allocs estimate: 30.

julia> @benchmark fft(arr_cu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.308 μs …  10.857 ms  ┊ GC (min … max): 0.00% … 21.22%
 Time  (median):      83.957 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   100.744 μs ± 386.268 μs  ┊ GC (mean ± σ):  2.91% ±  0.78%

    ▁▇█▇▄▂
  ▂▄███████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▂▂▂▁▂▁▂▁▁▁▂▂▁▁▂▂▁▁▁▂▂▁▂ ▃
  80.3 μs          Histogram: frequency by time          121 μs <

 Memory estimate: 5.19 KiB, allocs estimate: 94.

There isn’t an fft library for metal.jl. So you are right that it isn’t using the gpu. CUDA.jl wraps Nvidia’s cufft library and there isn’t an equivalent for metal as far as I know.

Thanks for your reply. I found this in the Apple Developer Documentation. Do you think this has the potential to become the equivalent of cuFFT?

I also found out that Vulkan has added Metal support and its VkFFT. I plan to give it try later through Vulkan.jl. However, it doesn’t seem straightforward to me how I would be able to use the GPU for fft and do CPU operations on the same array without copying…

What I really want to achieve is what’s described here, but doing fft() instead of cos().

That’s not a Metal library though, so should rather be added to GitHub - JuliaLinearAlgebra/AppleAccelerate.jl: Julia interface to the macOS Accelerate framework. In fact, it does already have some commented-out code for FFT, https://github.com/JuliaMath/AppleAccelerate.jl/blob/d46c7fa39fc07fc257d07d7bbe6c64d35046e0e9/src/DSP.jl#L387-L425, so maybe you could start there.

2 Likes

Just found out this C library called VkFFT. It says it supports Metal as a backend. I’m wondering if it will be possible to create a Julia wrapper of this library, so that we can perform FFT using Metal on Macs.

1 Like

I recently discovered also VkFFTCUDA.jl.

But not easy to install.

See also this issue.

1 Like

Should we try to access the Apple Accelerate API from Julia?

Yes!