Flux: GPU slower than CPU?

I have a loss function which I’ve defined using Flux. On my boring old CPU on one batch, it performs like this:

julia> @time loss(x, y)
  0.071089 seconds (402 allocations: 20.235 MiB, 4.37% gc time)
1.3092503770133925 (tracked)

When running on a machine with a fancy GPU, it performs like this:

julia> @time loss(x, y)
  1.685316 seconds (1.24 M allocations: 61.493 MiB, 0.59% gc time)
2.6197212f0 (tracked)

(and yes, I have run the precompiler before testing in both cases). What’s going on here? Why does the GPU version allocate 3000x the amount of memory and take 25x longer?

What is the loss function here?

I’m using the logitcrossentropy function that comes with Flux

Can you post a reproducible example?

Here goes:

using Flux
# using CuArrays if you are on the GPU
d = Dense(50, 10) |> gpu
n = 50000
x = rand(50, n) |> gpu
hcat([[i == j for i = 1:10] for j = rand(1:10, n)]...) .* 1. |> gpu
loss(a, b) = Flux.logitcrossentropy(d(a), b)

CPU:

julia> @time loss(x, y)
  0.331867 seconds (139.06 k allocations: 77.120 MiB, 35.57% gc time)
2.4550202681078717 (tracked)

GPU:

julia> @time loss(x, y)
 24.348809 seconds (17.95 M allocations: 892.661 MiB, 0.62% gc time)
2.5094972f0 (tracked)

[edited to remove the prompt to allow for easier copying]

Also for what it’s worth:

thing I am using version of that thing
julia 0.6.4
Flux 0.5.4
CuArrays 0.6.2
ubuntu 16.04
gpu tesla k80
cuda 9.2

Still stumped. Is it an issue with my setup? Can anyone else reproduce?

Flux is great but many of its operations have not yet been optimized for GPU. Such as the issue https://github.com/FluxML/Flux.jl/issues/189