ForwardDiff chunk size: for what kinds of problems does it make a large difference?

question

#1

How much have you guys noticed the chunk size mattering for ForwardDiff.jl timings? I don’t see anything more than like 2% of a timing difference in the functions I am testing on (I am testing it through NLsolve as well), so I am just fixing the chunk size to be 1 (fixing to be type-stable, and at 1 since it must be less than or equal to the input vector). But if the developers chose to adapt the chunk size to the input vector, does that mean it does have a larger affect in some cases?

Maybe I am just missing the case where this is important? In which case, what’s a good example function that can differentiate the timings based on the chosen chunk size?


#2
using ForwardDiff

function ff!(y,x)
    v = norm(x)
    for i in 1:length(y)
        y[i] = x[i] * v
    end
    return nothing
end

function bench_jacobian(N)
    x = rand(N)
    y = zeros(N)
    J = zeros(N, N)
    for c in 1:N
        ycfg = ForwardDiff.JacobianConfig{c}(zeros(N), x)
        res = @benchmark ForwardDiff.jacobian!($J, $ff!, $y, $x, $ycfg)
        println(minimum(res.times))
    end
end
```

Results for N = 10

```
5473.9
3706.25
3753.75
3463.4444444444443
3041.0
3421.2
3848.875
1026.3
1140.4
878.5333333333333
```

What code did you use? How did you benchmark?

#3

@kristoffer.carlsson already provided a good concrete example, but maybe I can help provide a theoretical perspective for how changes in the chunk size affect performance.

Generally, raising the chunk size will reduce the number of calls that need to be made to the objective function at the cost of performing additional multiplications at each intermediate operation in the objective function. Raising the chunk size also changes the stack layout (as the extra epsilon components of the Dual numbers are stack-allocated) and can thus increase stack pressure.

The chunk size “sweet spot” for any given function is going to strike a balance between minimizing the number of objective function calls without incurring “too much” multiplication overhead or thrashing the stack.

For example, a function composed of a few very cheap operation may benefit from a relatively lower chunk size, since the additional multiplications might be expensive relative to the cost of just calling the function.

Also keep in mind that the number of saved function calls is inversely related to the chunk size; halving the number of function calls if N = 1 requires jumping to N = 2 (costing one additional multiply-per-operation), while halving the number of function calls if N = 5 requires jumping to N = 10 (incurring 5 extra multiplies-per-operation for the same benefit).

Benchmarking the test functions in DiffBase with different chunk sizes might help build intuition for how different functions respond to different chunk sizes; some of those functions should be more sensitive than others.


#4

Also, running with ./julia -O3 the extra additions and multiplications for the operations on the dual part seems to vectorize well (use SIMD intructions). This means that unless the function itself is vectorized, the extra operations on the dual parts are quite cheap and that might push up the threshold for what the optimal chunk size is for a given function.