I want to understand the meaning of
Δ used in different methods (i.e. for different optimizers) in Flux.
For example: in Descent, the meaning of
Δ seems to be:
While looking at the meaning of
Δ in AdaGrad, the meaning of AdaGrad takes time into account as well (if I understood AdaGrad correctly), then
Δ mathematically means a diagonal matrix where
ith element is:
, considering that my understanding is correct. How is
Δ given different (types) of values? Since its a part of the API, which is usually called by
update! function, which eventually calls
x -= Δ. I’m not able to wrap my head around as to how that operation (even when vectorised) is going to be a valid one?