Lifting a Julia function into a Flux "layer"

Ok, so the docs have this example:

opt = SGD([W, b], 0.1) # Gradient descent with learning rate 0.1

opt() # Carry out the update, modifying `W` and `b`.

So let’s try it for the case above:

using Flux
using Flux.Tracker
using Flux.Optimise

W = param(rand(2, 5))
b = param(rand(2))

predict(x) = W*x .+ b
loss(x, y) = sum((predict(x) .- y).^2)

x, y = rand(5), rand(2) # Dummy data

par = Params([W, b])
grads = Tracker.gradient(() -> loss(x, y), par)

opt = SGD(par, 0.1)
opt()

This returns

MethodError: no method matching length(::Params)
Closest candidates are:
  length(!Matched::Core.SimpleVector) at essentials.jl:576
  length(!Matched::Base.MethodList) at reflection.jl:728
  length(!Matched::Core.MethodTable) at reflection.jl:802
  ...

But it doesn’t make sense, anyway - the call uses parameters, but has no reference to the loss function.

It seems that for proper Flux model, [W,b] or params(m) must store the loss function under the hood. But it’s not clear how to connect this for a hand-rolled loss function.