ForwardDiff chunk size: for what kinds of problems does it make a large difference?

ChrisRackauckas · January 1, 2017, 9:45am

How much have you guys noticed the chunk size mattering for ForwardDiff.jl timings? I don’t see anything more than like 2% of a timing difference in the functions I am testing on (I am testing it through NLsolve as well), so I am just fixing the chunk size to be 1 (fixing to be type-stable, and at 1 since it must be less than or equal to the input vector). But if the developers chose to adapt the chunk size to the input vector, does that mean it does have a larger affect in some cases?

Maybe I am just missing the case where this is important? In which case, what’s a good example function that can differentiate the timings based on the chosen chunk size?

kristoffer.carlsson · January 1, 2017, 12:07pm

using ForwardDiff

function ff!(y,x)
    v = norm(x)
    for i in 1:length(y)
        y[i] = x[i] * v
    end
    return nothing
end

function bench_jacobian(N)
    x = rand(N)
    y = zeros(N)
    J = zeros(N, N)
    for c in 1:N
        ycfg = ForwardDiff.JacobianConfig{c}(zeros(N), x)
        res = @benchmark ForwardDiff.jacobian!($J, $ff!, $y, $x, $ycfg)
        println(minimum(res.times))
    end
end
```

Results for N = 10

```
5473.9
3706.25
3753.75
3463.4444444444443
3041.0
3421.2
3848.875
1026.3
1140.4
878.5333333333333
```

What code did you use? How did you benchmark?

jrevels · January 2, 2017, 2:40am

@kristoffer.carlsson already provided a good concrete example, but maybe I can help provide a theoretical perspective for how changes in the chunk size affect performance.

Generally, raising the chunk size will reduce the number of calls that need to be made to the objective function at the cost of performing additional multiplications at each intermediate operation in the objective function. Raising the chunk size also changes the stack layout (as the extra epsilon components of the Dual numbers are stack-allocated) and can thus increase stack pressure.

The chunk size “sweet spot” for any given function is going to strike a balance between minimizing the number of objective function calls without incurring “too much” multiplication overhead or thrashing the stack.

For example, a function composed of a few very cheap operation may benefit from a relatively lower chunk size, since the additional multiplications might be expensive relative to the cost of just calling the function.

Also keep in mind that the number of saved function calls is inversely related to the chunk size; halving the number of function calls if N = 1 requires jumping to N = 2 (costing one additional multiply-per-operation), while halving the number of function calls if N = 5 requires jumping to N = 10 (incurring 5 extra multiplies-per-operation for the same benefit).

Benchmarking the test functions in DiffBase with different chunk sizes might help build intuition for how different functions respond to different chunk sizes; some of those functions should be more sensitive than others.

kristoffer.carlsson · January 2, 2017, 2:48am

Also, running with ./julia -O3 the extra additions and multiplications for the operations on the dual part seems to vectorize well (use SIMD intructions). This means that unless the function itself is vectorized, the extra operations on the dual parts are quite cheap and that might push up the threshold for what the optimal chunk size is for a given function.

Topic		Replies	Views
Questions regarding a few ForwardDiff.jl limitations General Usage package	2	712	June 12, 2017
ForwardDiff.jl \ Optim.jl and recompilation for any vector size Optimization (Mathematical)	4	560	January 10, 2021
Am I using DiffResults.jl correctly? Performance diffresults , forwarddiff	7	1003	March 14, 2020
Problem with ForwardDiff and pre-allocating vector Numerics	6	1395	December 14, 2017
Understanding and optimizing compiler time (just a bit) General Usage question	21	8341	May 14, 2024

ForwardDiff chunk size: for what kinds of problems does it make a large difference?

Related topics