 # Why is SUM always used in differentiation?

I’ve seen a lot of examples where the output of a NN or any function during the differentiation is preprocessed with `sum` operator. For example this:

``````julia> hessian(x -> sum(x.^3), [1 2; 3 4])  # uses linear indexing of x
4×4 Array{\$Int,2}:
6   0   0   0
0  18   0   0
0   0  12   0
0   0   0  24
``````

Is this a julia thing or there some math behind it?

It’s just a short way to make a function which returns a scalar. You will get an error with functions which return an array:

``````julia> gradient(x -> x.^3, [1 2; 3 4])
ERROR: output an array, so the gradient is not defined. Perhaps you wanted jacobian.
Stacktrace:
 error(s::String)
@ Base ./error.jl:33

julia> jacobian(x -> x.^3, [1 2; 3 4])
4×4 Matrix{Int64}:
3   0   0   0
0  27   0   0
0   0  12   0
0   0   0  48

julia> gradient(x -> sum(x.^3), [1 2; 3 4])
2×2 Matrix{Int64}:
3  12
27  48
``````
6 Likes

To expand a bit, a neural network or other optimization usually needs to be trained to minimize a scalar objective function. So you usually want to define the problem with a scalar in the first place. In fact, the term `gradient` usually refers to the partial derivative of a scalar function with respect to one or more variables, and `jacobian` (in @mcabbott’s example) to the partial derivative of a vector function. There are plenty of uses for Jacobians, but gradients are more common in optimization, which is one of the most popular uses for AD.
I consider `sum` to be more than a way to get Julia not to error, but a key part of defining a sensible problem to be solved.