To expand a bit, a neural network or other optimization usually needs to be trained to minimize a scalar objective function. So you usually want to define the problem with a scalar in the first place. In fact, the term gradient usually refers to the partial derivative of a scalar function with respect to one or more variables, and jacobian (in @mcabbott’s example) to the partial derivative of a vector function. There are plenty of uses for Jacobians, but gradients are more common in optimization, which is one of the most popular uses for AD.
I consider sum to be more than a way to get Julia not to error, but a key part of defining a sensible problem to be solved.