Flux with Metal backend slower than on CPU (Apple M2 Pro)

pat-alt · April 17, 2024, 3:30pm

This might be a very silly question (so apologies in advance!), but I have found little discussion on this topic anywhere.

I have yet to actually make of my Mac’s M2 GPU using Flux with the Metal.jl backend. But every time I come back to this, I seem to find that my CPU is much faster. I’m a complete newb when it comes to GPUs, so perhaps I’m just doing something silly.

Here’s a MWE first running on CPU and then GPU.

CPU

julia> using Flux, Metal, Plots
       
       to_device(x) = cpu(x)
       
       n = 1000
       bs = 64
       
       X = f32(rand(1, n)) |> to_device
       actual(x) = 4x + 2 
       Y = actual.(X) |> to_device
       Y .+= 0.1f0 * randn(size(Y)) |> to_device
       n_hidden = 100
       model = f32(Chain(Dense(1, n_hidden), Dense(n_hidden, n_hidden), Dense(n_hidden, 1))) |> to_device
       
       loss(x, y) = Flux.Losses.mse(x, y)
       rule = Descent()
       opt_state = Flux.setup(rule, model)
       
       train_loader = Flux.DataLoader((X, Y), batchsize=bs, shuffle=true)
       epochs = 10
       
       println(Metal.versioninfo())
       
       t = @elapsed for epoch in 1:epochs
           @info "Epoch $epoch"
           for (x, y) in train_loader
               grads = gradient(m -> loss(m(x), y), model)
               Flux.update!(opt_state, model, grads[1])
           end
       end
       
       @info "Time taken: $t"
macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Pro (6.219 MiB allocated)
nothing
[ Info: Epoch 1
[ Info: Epoch 2
[ Info: Epoch 3
[ Info: Epoch 4
[ Info: Epoch 5
[ Info: Epoch 6
[ Info: Epoch 7
[ Info: Epoch 8
[ Info: Epoch 9
[ Info: Epoch 10
[ Info: Time taken: 0.158244458

GPU

julia> using Flux, Metal, Plots
       
       to_device(x) = gpu(x)
       
       n = 1000
       bs = 64
       
       X = f32(rand(1, n)) |> to_device
       actual(x) = 4x + 2 
       Y = actual.(X) |> to_device
       Y .+= 0.1f0 * randn(size(Y)) |> to_device
       n_hidden = 100
       model = f32(Chain(Dense(1, n_hidden), Dense(n_hidden, n_hidden), Dense(n_hidden, 1))) |> to_device
       
       loss(x, y) = Flux.Losses.mse(x, y)
       rule = Descent()
       opt_state = Flux.setup(rule, model)
       
       train_loader = Flux.DataLoader((X, Y), batchsize=bs, shuffle=true)
       epochs = 10
       
       println(Metal.versioninfo())
       
       t = @elapsed for epoch in 1:epochs
           @info "Epoch $epoch"
           for (x, y) in train_loader
               grads = gradient(m -> loss(m(x), y), model)
               Flux.update!(opt_state, model, grads[1])
           end
       end
       
       @info "Time taken: $t"
macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Pro (6.266 MiB allocated)
nothing
[ Info: Epoch 1
[ Info: Epoch 2
[ Info: Epoch 3
[ Info: Epoch 4
[ Info: Epoch 5
[ Info: Epoch 6
[ Info: Epoch 7
[ Info: Epoch 8
[ Info: Epoch 9
[ Info: Epoch 10
[ Info: Time taken: 1.794397708

Am I doing something wrong? I’ve tried playing with the parameters as well and noted that the CPU quickly gets slower for larger models, while the GPU seems less affected by model size (but still generally slower than the CPU).

rveltz · April 17, 2024, 5:06pm

probably too small problem

maleadt · April 17, 2024, 5:39pm

Metal.jl has also not seen the level of optimization that e.g. CUDA.jl has.

pitsianis · April 17, 2024, 6:03pm

When I try to run the above, I got

┌ Info: The CUDA functionality is being called but
│ `CUDA.jl` must be loaded to access it.
└ Add `using CUDA` or `import CUDA` to your code.

So I added CUDA, and then it asked

  24 dependencies successfully precompiled in 46 seconds. 139 already precompiled.
┌ Warning: Package cuDNN not found in current path.
│ - Run `import Pkg; Pkg.add("cuDNN")` to install the cuDNN package, then restart julia.
│ - If cuDNN is not installed, some Flux functionalities will not be available when running on the GPU.
└ @ FluxCUDAExt ~/.julia/packages/Flux/Wz6D4/ext/FluxCUDAExt/FluxCUDAExt.jl:57

I added that as well and restarted julia, but I still get

┌ Info: The CUDA function is being called but CUDA.jl is not functional.
└ Defaulting back to the CPU. (No action is required if you want to run on the CPU).

What am I missing here?

I am on

macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Max (64.000 KiB allocated)

mcabbott · April 17, 2024, 6:26pm

This example does seem very small for any GPU to speed up. But in general I’ve yet to see an example of a Flux model that Metal speeds up, compared to CPU.

You currently need something like Flux.gpu_backend!("Metal") and re-start Julia, to look for Metal instead of CUDA. Docs here: GPU Support · Flux

pat-alt · April 18, 2024, 9:40am

Ok, thanks all, that makes sense.

Topic		Replies	Views
Flux model on CPU runs slowly Performance question , flux	3	439	October 4, 2020
Flux.jl: training fails at GPU but works on CPU Machine Learning gpu , flux	1	647	September 19, 2019
Apple M1 GPU from Julia? GPU question	20	5959	March 31, 2023
Help with Flux.jl, Metal.jl (Apple Silicon) and Conv layers Machine Learning	2	1058	February 23, 2024
Flux: GPU slower than CPU? GPU question , performance , flux	7	2228	August 10, 2018

Flux with Metal backend slower than on CPU (Apple M2 Pro)

CPU

GPU

Related topics