# Are computations with Dual numbers twice as slow as non Duals?

After reading this I got the impression that computations with dual numbers are typically expected to be twice as slow as non-duals. The reasoning was that both the normal and dual computations have to calculated at each operation, much like in complex arithmetic.

However I fail to observe this in the following minimal example. The example is motivated by the computation of parts of the gradient of a function `g(x::AbstractVector)`. Assuming forward mode AD, I can either split the function’s input into two vectors (one operating with dual numbers and the other not) or use dual numbers for everything. In doing the latter I don’t observe the nearly as much as `(2*5 + 2*20)/(2* 5 + 20)` = 1.667x slowdown that I would expect.

``````using ForwardDiff
using BenchmarkTools
using Base: Iterators

function f(x::AbstractVector{T}, y::AbstractVector) where {T}
result = zero(T)
for j in 1:1000  # 1000 "computations" to simulate a situation where f or g dominates allocation costs in override
for value in x
result += (0.9 * value)^j
end
for value in y
result += (0.9 * value)^j
end
end
return result
end

const x_values = ones(5)
const y_values = ones(20)

@btime ForwardDiff.gradient(x -> f(x, y_values), x_values)

const combined_values = vcat(x_values, y_values)

function g(combined::AbstractVector{T}) where {T}
result = zero(T)
for j in 1:1000  # 1000 "computations" to simulate a situation where f or g dominates allocation costs in override
for value in combined
result += (0.9 * value)^j
end
end
return result
end

function override(x, y::T) where {T}
result = T(x)
for i in 1:length(y)
result[i] = y[i]
end
return result
end

override(combined_values, x_values)
@btime ForwardDiff.gradient(x -> g(override(combined_values, x)), x_values)
``````

Which gives:

``````  587.081 μs (4 allocations: 848 bytes)
657.692 μs (5 allocations: 2.16 KiB)
``````

Normally you can SIMD the primal and dual parts because there’s a lot of related computations.

2 Likes

`^` the slow part is calculating the power of the values. The fast part is updating partials via multiplication.

``````if y == zero(y) || iszero(partials(x))
new_partials = zero(partials(x))
else
new_partials = partials(x) * y * (\$f)(v, y - 1)
end
``````

If you have non-zero partials with respect to `x` in `x ^ y`, it would calculate `^` twice – I assume for accuracy reasons instead of `x ^ (y-1)` and multiplying for `x ^ y`.

In your chunk mode gradient, the `partials(x)` will sometimes be 0, letting it skip the second `^`.
Take a linear combination before looping and `^` (e.g. multiply by a dense square matrix) to ensure none of the partials are zero, and you’ll probably get closer to the expected 2x.