Flux model on CPU runs slowly

I find the Flux model on CPU runs more slowly than that on GPU:

julia> m = Chain(
           Dense(2250, 500, σ),
           Dense(500, 50, tanh),
           Dense(50, 7, σ));

julia> X = Float32.(X);

julia> size(X)
 (2250, 484)

julia> @btime m(X);
  10.718 ms (10 allocations: 2.06 MiB)

julia> X_gpu = X |> gpu;

julia> m_gpu = m |> gpu;

julia> @btime m_gpu(X_gpu);
  21.864 μs (134 allocations: 3.80 KiB)

And training the model on CPU was untolarably slow, and I am really confused.

That’s a really big matmul for a GPU :man_shrugging:. It should be about the same cost on any CPU implementation since its all going to be in a BLAS kernel.


And when I trained the same Pytorch model on CPU, it seemed much faster than the Flux model on CPU.

Julia’s tanh is fairly slow, but the bottleneck really should be the matmul