Why is SUM always used in differentiation?

I’ve seen a lot of examples where the output of a NN or any function during the differentiation is preprocessed with sum operator. For example this:

julia> hessian(x -> sum(x.^3), [1 2; 3 4])  # uses linear indexing of x
4×4 Array{$Int,2}:
 6   0   0   0
 0  18   0   0
 0   0  12   0
 0   0   0  24

Is this a julia thing or there some math behind it?

It’s just a short way to make a function which returns a scalar. You will get an error with functions which return an array:

julia> gradient(x -> x.^3, [1 2; 3 4])
ERROR: output an array, so the gradient is not defined. Perhaps you wanted jacobian.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33

julia> jacobian(x -> x.^3, [1 2; 3 4])[1]
4×4 Matrix{Int64}:
 3   0   0   0
 0  27   0   0
 0   0  12   0
 0   0   0  48

julia> gradient(x -> sum(x.^3), [1 2; 3 4])[1]
2×2 Matrix{Int64}:
  3  12
 27  48
6 Likes

Great answer, thank you!

To expand a bit, a neural network or other optimization usually needs to be trained to minimize a scalar objective function. So you usually want to define the problem with a scalar in the first place. In fact, the term gradient usually refers to the partial derivative of a scalar function with respect to one or more variables, and jacobian (in @mcabbott’s example) to the partial derivative of a vector function. There are plenty of uses for Jacobians, but gradients are more common in optimization, which is one of the most popular uses for AD.

I consider sum to be more than a way to get Julia not to error, but a key part of defining a sensible problem to be solved.

2 Likes