This might be a very silly question (so apologies in advance!), but I have found little discussion on this topic anywhere.
I have yet to actually make of my Mac’s M2 GPU using Flux with the Metal.jl backend. But every time I come back to this, I seem to find that my CPU is much faster. I’m a complete newb when it comes to GPUs, so perhaps I’m just doing something silly.
Here’s a MWE first running on CPU and then GPU.
CPU
julia> using Flux, Metal, Plots
to_device(x) = cpu(x)
n = 1000
bs = 64
X = f32(rand(1, n)) |> to_device
actual(x) = 4x + 2
Y = actual.(X) |> to_device
Y .+= 0.1f0 * randn(size(Y)) |> to_device
n_hidden = 100
model = f32(Chain(Dense(1, n_hidden), Dense(n_hidden, n_hidden), Dense(n_hidden, 1))) |> to_device
loss(x, y) = Flux.Losses.mse(x, y)
rule = Descent()
opt_state = Flux.setup(rule, model)
train_loader = Flux.DataLoader((X, Y), batchsize=bs, shuffle=true)
epochs = 10
println(Metal.versioninfo())
t = @elapsed for epoch in 1:epochs
@info "Epoch $epoch"
for (x, y) in train_loader
grads = gradient(m -> loss(m(x), y), model)
Flux.update!(opt_state, model, grads[1])
end
end
@info "Time taken: $t"
macOS 14.4.1, Darwin 23.4.0
Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7
Julia packages:
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1
1 device:
- Apple M2 Pro (6.219 MiB allocated)
nothing
[ Info: Epoch 1
[ Info: Epoch 2
[ Info: Epoch 3
[ Info: Epoch 4
[ Info: Epoch 5
[ Info: Epoch 6
[ Info: Epoch 7
[ Info: Epoch 8
[ Info: Epoch 9
[ Info: Epoch 10
[ Info: Time taken: 0.158244458
GPU
julia> using Flux, Metal, Plots
to_device(x) = gpu(x)
n = 1000
bs = 64
X = f32(rand(1, n)) |> to_device
actual(x) = 4x + 2
Y = actual.(X) |> to_device
Y .+= 0.1f0 * randn(size(Y)) |> to_device
n_hidden = 100
model = f32(Chain(Dense(1, n_hidden), Dense(n_hidden, n_hidden), Dense(n_hidden, 1))) |> to_device
loss(x, y) = Flux.Losses.mse(x, y)
rule = Descent()
opt_state = Flux.setup(rule, model)
train_loader = Flux.DataLoader((X, Y), batchsize=bs, shuffle=true)
epochs = 10
println(Metal.versioninfo())
t = @elapsed for epoch in 1:epochs
@info "Epoch $epoch"
for (x, y) in train_loader
grads = gradient(m -> loss(m(x), y), model)
Flux.update!(opt_state, model, grads[1])
end
end
@info "Time taken: $t"
macOS 14.4.1, Darwin 23.4.0
Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7
Julia packages:
- Metal.jl: 1.1.0
- LLVMDowngrader_jll: 0.1.0+1
1 device:
- Apple M2 Pro (6.266 MiB allocated)
nothing
[ Info: Epoch 1
[ Info: Epoch 2
[ Info: Epoch 3
[ Info: Epoch 4
[ Info: Epoch 5
[ Info: Epoch 6
[ Info: Epoch 7
[ Info: Epoch 8
[ Info: Epoch 9
[ Info: Epoch 10
[ Info: Time taken: 1.794397708
Am I doing something wrong? I’ve tried playing with the parameters as well and noted that the CPU quickly gets slower for larger models, while the GPU seems less affected by model size (but still generally slower than the CPU).