Hi,
I want to understand the meaning of Δ
used in different methods (i.e. for different optimizers) in Flux.
For example: in Descent, the meaning of Δ
seems to be:
[∂f/∂θᵢ] i=1,2,3,…
While looking at the meaning of Δ
in AdaGrad, the meaning of AdaGrad takes time into account as well (if I understood AdaGrad correctly), then Δ
mathematically means a diagonal matrix where ith
element is:
, considering that my understanding is correct. How is Δ
given different (types) of values? Since its a part of the API, which is usually called by update!
function, which eventually calls x -= Δ
. I’m not able to wrap my head around as to how that operation (even when vectorised) is going to be a valid one?