I have a loss function which I’ve defined using Flux. On my boring old CPU on one batch, it performs like this:
julia> @time loss(x, y) 0.071089 seconds (402 allocations: 20.235 MiB, 4.37% gc time) 1.3092503770133925 (tracked)
When running on a machine with a fancy GPU, it performs like this:
julia> @time loss(x, y) 1.685316 seconds (1.24 M allocations: 61.493 MiB, 0.59% gc time) 2.6197212f0 (tracked)
(and yes, I have run the precompiler before testing in both cases). What’s going on here? Why does the GPU version allocate 3000x the amount of memory and take 25x longer?