I was hoping to explore the possibility of using Metal.jl to speed up Fourier transfer without needing to copy between CPU and GPU. However, my benchmark results showed no improvement in speed which makes me suspect that it is not using the GPU on the M1 Max chip at all. I guess additional development is needed to eventually make it work, but I’m not sure whether this is related to Metal.jl, FFTW.jl, or even Apple’s own GPU libraries… Can anybody provide any advice?
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.5 (2023-01-08)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using Metal
julia> using FFTW
julia> using BenchmarkTools
julia> dims = tuple(1024, 1024)
(1024, 1024)
julia> arr = rand(Float32, dims...);
julia> arr_mtl = MtlArray(arr);
julia> @benchmark fft(arr)
BenchmarkTools.Trial: 215 samples with 1 evaluation.
Range (min … max): 21.841 ms … 26.391 ms ┊ GC (min … max): 0.00% … 10.58%
Time (median): 22.913 ms ┊ GC (median): 0.00%
Time (mean ± σ): 23.308 ms ± 927.937 μs ┊ GC (mean ± σ): 2.16% ± 3.32%
▄ ▆▄█▃▆▄▂
▃▁▃▃▃▁▅▇█████████▇██▇▄▄▅▄▁▄▃▁▄▄▆▄▄▅▅▃▅▄▆▃▆▅▇▄▄▄▃▅▄▁▃▃▁▁▃▁▁▁▃ ▄
21.8 ms Histogram: frequency by time 25.9 ms <
Memory estimate: 16.00 MiB, allocs estimate: 30.
julia> @benchmark fft(arr_mtl)
BenchmarkTools.Trial: 210 samples with 1 evaluation.
Range (min … max): 22.482 ms … 26.350 ms ┊ GC (min … max): 0.00% … 11.00%
Time (median): 23.411 ms ┊ GC (median): 0.00%
Time (mean ± σ): 23.816 ms ± 960.537 μs ┊ GC (mean ± σ): 2.67% ± 3.50%
▂▃█ ▂ ▂ ▂
▅▄▆▃▆▇▆▇███▇████▇▅▃█▃▆▁▃▅▃▁▃▃▄▄▅▃▅▆▃▇▄▅▃▃▄▄▅▆▇▃▆▄▅▃▄▆▄▄▃▁▃▁▄ ▄
22.5 ms Histogram: frequency by time 25.9 ms <
Memory estimate: 20.00 MiB, allocs estimate: 43.
For comparison, CUDA.jl gives a big increase in speed for the same test (AMD 3990X, NVIDIA 3090)
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.8.5 (2023-01-08)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
julia> using CUDA
julia> using FFTW
julia> using BenchmarkTools
julia> dims = tuple(1024, 1024)
(1024, 1024)
julia> arr = rand(Float32, dims...);
julia> arr_cu = CuArray(arr);
julia> @benchmark fft(arr)
BenchmarkTools.Trial: 92 samples with 1 evaluation.
Range (min … max): 43.485 ms … 68.050 ms ┊ GC (min … max): 0.00% … 0.91%
Time (median): 56.395 ms ┊ GC (median): 0.00%
Time (mean ± σ): 54.326 ms ± 6.909 ms ┊ GC (mean ± σ): 0.46% ± 0.90%
▇ ▄▁ █
█▃▁▁▁▁▃▁▃▃▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▆██▄█▁▁▃▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▃▃ ▁
43.5 ms Histogram: frequency by time 68 ms <
Memory estimate: 16.00 MiB, allocs estimate: 30.
julia> @benchmark fft(arr_cu)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 80.308 μs … 10.857 ms ┊ GC (min … max): 0.00% … 21.22%
Time (median): 83.957 μs ┊ GC (median): 0.00%
Time (mean ± σ): 100.744 μs ± 386.268 μs ┊ GC (mean ± σ): 2.91% ± 0.78%
▁▇█▇▄▂
▂▄███████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▂▂▂▁▂▁▂▁▁▁▂▂▁▁▂▂▁▁▁▂▂▁▂ ▃
80.3 μs Histogram: frequency by time 121 μs <
Memory estimate: 5.19 KiB, allocs estimate: 94.