I’m trying to run a straightforward regression problem through Flux. I’ve modified the loss function from the example, from a simple mse to something like a chi-square. The problem has to do with the calibration of a calorimeter, and the error term of the chi-square is proportional to the sqrt of the energy.
The following code exemplifies the problem:

using Flux
model = Chain(
Dense(250, 64),
Dense(64, 1)
)
x = rand(250, 1000)
y = rand(1000)
function loss(x, y)
error = sqrt.(y)
sqrt(sum(((model(x) .- y).^2)./error))
end
@time Flux.Tracker.gradient(()->loss(x, y), params(model))

This code results in the following output on my machine.

15.787468 seconds (211.02 M allocations: 5.623 GiB, 4.70% gc time)

Removing the error term from the loss function gives me:

0.117279 seconds (15.27 k allocations: 59.252 MiB, 14.24% gc time)

In my actual example (where my input vector is of size 200k), I just can’t train with the loss function with the error term at all, but without it, my training is suboptimal, and I think I need this term.
Does anybody have a suggestion for how to get this loss function to work?

Not a complete answer, but note that in your current loss function,

error is allocated as a temporary variable.

If instead, the sqrt.(y) is simply included on the same line:

function loss(x, y)
sqrt(sum(((model(x) .- y).^2)./sqrt.(y)))
end

the allocations are reduced, and for me computation time is cut by more than half:

Before:

16.608458 seconds (211.00 M allocations: 5.623 GiB, 3.29% gc time)

After:

6.484304 seconds (135.00 M allocations: 3.269 GiB, 6.02% gc time)

My understanding is that when the broadcast dots (.) are all part of the same evaluation, the function calculations are fused and avoid temporaries (although it would be nice if the same could be recognized in your original function).

This obviously still isn’t as fast as if the error term were not there, and there are still quite a few allocations. Part of the slowdown might simply be that division is a slow operation, but there may be further changes that would help.

Now I feel stupid.
This fixes both the computational performance as well as the problem with my training not converging.
(not surprisingly, if I used the wrong loss function).
I always spend inane amounts of time fixing dimension problems like this in numpy, but I had thought my extension of the linear regression example from the model zoo was straightforward enough…
Many thanks!