ForwardDiff propagates sensitivities wrt all the parameters, so while it’s technically a single function call, inside this function call is the equivalent of many function calls. To compute gradients in about the same time as a single function call (in theory) for functions with scalar output, use ReverseDiff.
There are many resources on how forward and reverse differentiation work, but I was confused about it for a long time before I found a way of thinking about it that spoke to me: for a function f with jacobian Jf computed as the composition of many functions (say f = f1 o f2 for simplicity), forward-mode differentiation is a way of efficiently computing J*(delta_x) for any given delta_x. By the chain rule, Jf*delta_x = Jf1*(Jf2*delta_x), and so all you have to do is propagate the sensitivities, taking about the same time as calling f. For a function R^n -> R though, you have to do this for N different delta_x to build the gradient. Instead, reverse-mode differentiation is a way of computing Jf^T * (delta_f) for any given delta_f, again via the chain rule, but this time transposed: J^T delta_y = Jf2^T*(Jf1^T * delta_f). Now it’s enough to do it only once to compute a gradient, which is why it’s much more efficient for gradients. However, because the transpose changes the order of application of the inner functions, you have to go through the computation tree in reverse, which can cause overhead.