Flux with Metal backend slower than on CPU (Apple M2 Pro)

This might be a very silly question (so apologies in advance!), but I have found little discussion on this topic anywhere.

I have yet to actually make of my Mac’s M2 GPU using Flux with the Metal.jl backend. But every time I come back to this, I seem to find that my CPU is much faster. I’m a complete newb when it comes to GPUs, so perhaps I’m just doing something silly.

Here’s a MWE first running on CPU and then GPU.

CPU

julia> using Flux, Metal, Plots
       
       to_device(x) = cpu(x)
       
       n = 1000
       bs = 64
       
       X = f32(rand(1, n)) |> to_device
       actual(x) = 4x + 2 
       Y = actual.(X) |> to_device
       Y .+= 0.1f0 * randn(size(Y)) |> to_device
       n_hidden = 100
       model = f32(Chain(Dense(1, n_hidden), Dense(n_hidden, n_hidden), Dense(n_hidden, 1))) |> to_device
       
       loss(x, y) = Flux.Losses.mse(x, y)
       rule = Descent()
       opt_state = Flux.setup(rule, model)
       
       train_loader = Flux.DataLoader((X, Y), batchsize=bs, shuffle=true)
       epochs = 10
       
       println(Metal.versioninfo())
       
       t = @elapsed for epoch in 1:epochs
           @info "Epoch $epoch"
           for (x, y) in train_loader
               grads = gradient(m -> loss(m(x), y), model)
               Flux.update!(opt_state, model, grads[1])
           end
       end
       
       @info "Time taken: $t"
macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Pro (6.219 MiB allocated)
nothing
[ Info: Epoch 1
[ Info: Epoch 2
[ Info: Epoch 3
[ Info: Epoch 4
[ Info: Epoch 5
[ Info: Epoch 6
[ Info: Epoch 7
[ Info: Epoch 8
[ Info: Epoch 9
[ Info: Epoch 10
[ Info: Time taken: 0.158244458

GPU

julia> using Flux, Metal, Plots
       
       to_device(x) = gpu(x)
       
       n = 1000
       bs = 64
       
       X = f32(rand(1, n)) |> to_device
       actual(x) = 4x + 2 
       Y = actual.(X) |> to_device
       Y .+= 0.1f0 * randn(size(Y)) |> to_device
       n_hidden = 100
       model = f32(Chain(Dense(1, n_hidden), Dense(n_hidden, n_hidden), Dense(n_hidden, 1))) |> to_device
       
       loss(x, y) = Flux.Losses.mse(x, y)
       rule = Descent()
       opt_state = Flux.setup(rule, model)
       
       train_loader = Flux.DataLoader((X, Y), batchsize=bs, shuffle=true)
       epochs = 10
       
       println(Metal.versioninfo())
       
       t = @elapsed for epoch in 1:epochs
           @info "Epoch $epoch"
           for (x, y) in train_loader
               grads = gradient(m -> loss(m(x), y), model)
               Flux.update!(opt_state, model, grads[1])
           end
       end
       
       @info "Time taken: $t"
macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Pro (6.266 MiB allocated)
nothing
[ Info: Epoch 1
[ Info: Epoch 2
[ Info: Epoch 3
[ Info: Epoch 4
[ Info: Epoch 5
[ Info: Epoch 6
[ Info: Epoch 7
[ Info: Epoch 8
[ Info: Epoch 9
[ Info: Epoch 10
[ Info: Time taken: 1.794397708

Am I doing something wrong? I’ve tried playing with the parameters as well and noted that the CPU quickly gets slower for larger models, while the GPU seems less affected by model size (but still generally slower than the CPU).

probably too small problem

1 Like

Metal.jl has also not seen the level of optimization that e.g. CUDA.jl has.

1 Like

When I try to run the above, I got

┌ Info: The CUDA functionality is being called but
│ `CUDA.jl` must be loaded to access it.
└ Add `using CUDA` or `import CUDA` to your code.

So I added CUDA, and then it asked

  24 dependencies successfully precompiled in 46 seconds. 139 already precompiled.
┌ Warning: Package cuDNN not found in current path.
│ - Run `import Pkg; Pkg.add("cuDNN")` to install the cuDNN package, then restart julia.
│ - If cuDNN is not installed, some Flux functionalities will not be available when running on the GPU.
└ @ FluxCUDAExt ~/.julia/packages/Flux/Wz6D4/ext/FluxCUDAExt/FluxCUDAExt.jl:57

I added that as well and restarted julia, but I still get

┌ Info: The CUDA function is being called but CUDA.jl is not functional.
└ Defaulting back to the CPU. (No action is required if you want to run on the CPU).

What am I missing here?

I am on

macOS 14.4.1, Darwin 23.4.0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Julia packages: 
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1

1 device:
- Apple M2 Max (64.000 KiB allocated)

This example does seem very small for any GPU to speed up. But in general I’ve yet to see an example of a Flux model that Metal speeds up, compared to CPU.

You currently need something like Flux.gpu_backend!("Metal") and re-start Julia, to look for Metal instead of CUDA. Docs here: GPU Support · Flux

2 Likes

Ok, thanks all, that makes sense.