Are computations with Dual numbers twice as slow as non Duals?

After reading this I got the impression that computations with dual numbers are typically expected to be twice as slow as non-duals. The reasoning was that both the normal and dual computations have to calculated at each operation, much like in complex arithmetic.

However I fail to observe this in the following minimal example. The example is motivated by the computation of parts of the gradient of a function g(x::AbstractVector). Assuming forward mode AD, I can either split the function’s input into two vectors (one operating with dual numbers and the other not) or use dual numbers for everything. In doing the latter I don’t observe the nearly as much as (2*5 + 2*20)/(2* 5 + 20) = 1.667x slowdown that I would expect.

using ForwardDiff
using BenchmarkTools
using Base: Iterators

function f(x::AbstractVector{T}, y::AbstractVector) where {T}
    result = zero(T)
    for j in 1:1000  # 1000 "computations" to simulate a situation where f or g dominates allocation costs in override
        for value in x
            result += (0.9 * value)^j
        end
        for value in y
            result += (0.9 * value)^j
        end
    end
    return result
end

const x_values = ones(5)
const y_values = ones(20)

@btime ForwardDiff.gradient(x -> f(x, y_values), x_values)

const combined_values = vcat(x_values, y_values)

function g(combined::AbstractVector{T}) where {T}
    result = zero(T)
    for j in 1:1000  # 1000 "computations" to simulate a situation where f or g dominates allocation costs in override
        for value in combined
            result += (0.9 * value)^j
        end
    end
    return result
end

function override(x, y::T) where {T}
    result = T(x)
    for i in 1:length(y)
        result[i] = y[i]
    end
    return result
end

override(combined_values, x_values)
@btime ForwardDiff.gradient(x -> g(override(combined_values, x)), x_values)

Which gives:

  587.081 μs (4 allocations: 848 bytes)
  657.692 μs (5 allocations: 2.16 KiB)

Normally you can SIMD the primal and dual parts because there’s a lot of related computations.

2 Likes

^ the slow part is calculating the power of the values. The fast part is updating partials via multiplication.

if y == zero(y) || iszero(partials(x))
    new_partials = zero(partials(x))
else
    new_partials = partials(x) * y * ($f)(v, y - 1)
end

If you have non-zero partials with respect to x in x ^ y, it would calculate ^ twice – I assume for accuracy reasons instead of x ^ (y-1) and multiplying for x ^ y.

In your chunk mode gradient, the partials(x) will sometimes be 0, letting it skip the second ^.
Take a linear combination before looping and ^ (e.g. multiply by a dense square matrix) to ensure none of the partials are zero, and you’ll probably get closer to the expected 2x.